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We study random walk based algorithms for posterior simulation 
in a large of class of Bayesian nonparametric problems with Gaussian 
random field or Gaussian process priors. Our emphasis is both on de- 
veloping practical guidelines for the design and implementation of ef- 
ficient algorithms for these naturally high dimensional problems, and 
on the development of rigorous underpinning theory. We emphasize 
the principle of 'optimal design' of MCMC methods on the underlying 
infinite dimensional space in Bayesian nonparametric problems and 
show that this design principle leads to vastly improved, and novel, 
algorithms in finite dimensions. 

To illustrate this idea we demonstrate that a small change in the 
mean of the proposal in a random walk algorithm can result in sig- 
nificant improvement in computational complexity as the dimension 
of the nonparametric approximation is refined. In particular we com- 
pare the computational complexity of this resulting new algorithm, 
pCN, with a standard random walk algorithm, RWM. RWM is opti- 
mized with respect to proposal vanance, as suggested by the prevail- 
ing optimal scaling theory. We show that pCN exhibits an order of 
magnitude improvement in computational complexity over RWM, in 
terms of the dimension of the space. 

Our methods of analysis rely on diffusion limits for the MCMC 
methods. This approach also enables us to develop a theory of simu- 
lated annealing for posterior simulation in an high (and infinite) di- 
mensional settings, and shows how a noisy gradient descent algorithm 
can emerge, without explicitly computing the gradient, from certain 
carefully specified random walks; this results from the Metropolis- 
Hastings accept-reject mechanism. This theory extends results known 
in finite dimensions for finding local maxima using a gradient flow and 
is hence of independent interest. 

1. Introduction. 

1.1. The Context. Bayesian nonparametrics have witnessed a rapid growth in the last decade, 
both in the construction of novel prior distributions [DPP07, WCTll] and obtaining results on 
posterior consistency and sharp convergence rates, e.g., [GVDVll, VDVVZ08]. Nonparametric 
methods are increasingly popular in many applications due to their flexibility to model a wide vari- 
ety of statistical problems when the parameter is naturally infinite dimensional. In such problems, 
the posterior distribution usually does not have a closed from solution and therefore one needs to 
resort to computational methods, such as the Markov Chain Monte Carlo (MCMC) algorithms, to 
get samples from the posterior distribution. In this regard, a wide array of fast computational algo- 
rithms have been developed for obtaining posterior draws [RC04, PROS, IZOO]. Here we focus on the 
optimal design of proposals for target measures arising in Bayesian nonparametrics with Gaussian 
random field priors. Gaussian random fields are ubiquitous in nonparametric modeling becuase of 
their remarkably rich sample properties, and the fact that they can be characterized entirely by the 
mean function and covariance operator [RW06]. In particular they arise in the Bayesian approach 
to inverse problems [Fit91, StulO] and in the study of conditioned diffusion processes [HSVIO]. 
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Since the parameter space is infinite dimensional in these problems, practical implementation of 
MCMC involves discretizing the parameter space, resulting in a target measure in M^, with iV S> 1. 
It is well known that such discretization schemes can suffer from the curse of dimensionality: the 
efficiency of the algorithm decreases as the dimension of the discretized space grows large. The 
central message of this paper is the following: 

Optimal Proposal Design Principle. Designing proposals which are well-defined on the infinite 
dimensional parameter space results in MCMC methods which do not suffer from the curse of 
dimensionality. 

In this work we will substantiate the above idea with a specific example consisting of two near- 
identical looking versions of the Random Walk Metropolis algorithm, namely the standard random 
walk (RWM) and a minor variant of it which we call the pCN algorithm (for preconditioned Crank- 
Nicolson, explained below), in the very general setting of Bayesian nonparametrics with Gaussian 
random field priors. The standard random walk proposal is not defined on the infinite dimensional 
parameter space and hence does not adhere to the optimal design principle. The prevailing tool cur- 
rently available to practitioners for implementing the RWM is the optimal scaling result [RGG97], 
which suggests optimal tuning of the size of the proposal variance. On the other hand the pCN 
algorithm is defined on the infinite dimensional space, and the optimal proposal design principle 
applied to this algorithm suggests tuning the proposal mean instead of the variance. Even though 
the construction of the above two algorithms are nearly identical, and thus implementing pCN 
over RWM requires only a minor change in implementation, it turns out that the pCN algorithm 
enjoys an order of magnitude in efficiency gain as compared to the RWM. We will rigorously show 
that the optimal tuning of the size of the proposal variance for RWM as suggested by the optimal 
scaling results of [RGG97] has a negligible effect on computational complexity in comparison with 
the effect obtained from following our optimal proposal design principle, which tunes the proposal 
mean resulting in the pCN algorithm. 

Before proceeding further, we would like to emphasize again that one of the key goals of our 
work is to convey the optimal design principle to practitioners. In particular, our exclusive focus 
on the RWM algorithm in this paper is made to demonstrate the efficacy of the optimal design 
principle in the simplest nontrivial example. We discuss this issue at the end of subsection 1.3 and 
in subsection 1.8. 

1.2. The Target and The Algorithms. First we describe the class of models to which our main 
results are applicable. We denote the prior and posterior distributions respectively by ttq and tt 
and consider models in which both ttq and n are measures on a Hilbert space i'H,{-,-),\\ ■ || ). 



Furthermore ttq = N(0, C) is assumed to be a Gaussian random field on T-L and C is a covariance 
operator. The posterior vr is given by the identity 



for a real valued functional ^ (which denotes the negative log-likelihood) and Mij, a normalizing 
constant. Although the above formulation may appear quite abstract, we emphasize that this points 
to the wide-ranging applicability of our theory: the setting encompasses a large class of models 
arising in practice, incuding nonparametric regression and diffusion processes [HSVIO, StulO]. In 
Section 2.4 we discuss a few commonly used statistical models which belong to our framework. 

Next we describe the class of MCMC algorithms which are our focus. Underlying the family 
of proposals we adopt is the assumption that it is straightforward to sample from the Gaussian 
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random field N(0, C). The proposals we consider are hence of the form 

(1.2) y = ax + V26(, 

where ^ ~ N(0, C) is chosen independently of x, (5 > sets the scale for the proposal variance 
and o G M scales the proposal mean. Two instances of the algorithm are of particular interest: the 
standard random walk (RWM) 

y = x + V26^ (1.3) 
and the preconditioned Crank-Nicolson walk (pCN) 

y = {1 - 26)^ + V26 (1.4) 

The preconditioned random walk (pCN) proposal is introduced in [BRSV08] as the PIA algo- 
rithm, with the parameter choices ^ = | and a = 0; the nomenclature used here, namely the 
preconditioned Crank-Nicolson algorithm - pCN algorithm - is introduced and put in historical 
context in [CRSWll]. To implement the algorithms in practice a finite dimensional approximation 
of the target vr is used. In our set-up, a natural way to perform the discretization is to consider the 
A'^-dimensional space spanned by the first N eigenfunctions of C, <Pj,l < j < N, project the target 
density into this basis and let the Markov chain evolve in M^. This is easy to implement as well, 
by setting in ^ ~ N(0, C^) in (1.2), where the diagonal elements of the covariance matrix are 
the first N eigenvalues of C and the off-diagonal elements are 0. 

Once this finite dimensional target is set-up, a key question for the practitioner is how to choose 
the free parameters a and 6 in the proposal? In the context of the family of proposals (1.2), most 
of the practical guidance provided by statistical theory to date is focussed on the RWM proposal 

(1.3) and optimal tuning of the proposal variance 6. We will show that this has negligible effect on 
computational complexity when compared with choosing the proposal mean scale a. In particular 
the choice a{S) = (1 — 26)2 appearing in the pCN proposal (1.4) gives the optimal proposal design 
within the family of proposals (1.2). 

1.3. Implications of Theory. Suppose we are interested in computing the expectation of test 
function / : — )• M using MCMC methods based on the family of proposals (1.2). We assume 
that / is bounded with bounded (Frechet) derivative. We let {x'''^}k^j^+ denote the Markov chain 
resulting from using the proposal (1.2) on the finite dimensional approximation of vr in and 
applying the Metropolis-Hastings accept-reject criterion. 

Consider the following estimator for the expectation of / under the measure vr given by 

K-l 

which is the just the empirical mean of the Markov chain output. We let C{N) denote the cost of 
evaluating one step of the Markov chain with proposal (1.2), including evaluation of the acceptance 
probability; this cost grows with dimension N, but is independent of a and 6. We ask the question: 
what is the computational cost, in terms of N, to ensure that fN,a,s is within a tolerance level e 
of E'^/. We let E'^ denote expectation with respect to n for fixed starting point in the Markov 
chain, and let E'^'* denote expectation with respect to tt for x^ in stationarity, i.e., x^ ~ vr. The 
following theorem is precisely stated and proved as Theorem 4.1. 
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Theorem 1: Optimal Proposal Mean Consider the MCMC method based on proposal (1.4) 
with optimally tuned mean a = (1 — 2(5)5. Then, for any e > and any in the support of tt, it 
is possible to choose K independent of N to ensure that 

|lE"/7V,a,5-IEV| <e, 
and the resulting computational cost grows with N as C{N). 

This theorem about computational cost, and the effect of scaling the mean parameter, will be a 
consequence of the theory we develop in this paper. For comparison it is instructive to see what 
optimal variance scaling suggested by the optimal scaling results for RWM [RGG97] tells us. This is 
the prevailing statistical theory which is adopted by practitioners to guide choice of proposals. The 
following assertion is stated more precisely and proved as Theorem 4.3, and is a direct consequence 
of recent theories which extend the work of [RGG97] to targets of the form (1.1) [MPSll]: 

Theorem 2 (for Comparison): Optimal Proposal Variance Consider the MCMC method 
based on proposal (1.3) with optimally tuned variance 6 = £ x N^'^ , to obtain acceptance probability 
0.234, and K = KqN'^. Then, for any e > and ~ it, it is possible to choose Kq independent 
of N so that 

|lE"'Vjv,a,5-IE"/| <e 
and the resulting computational cost grows with N as N x C{N). 

It is worth pausing to consider the implications of these two theorems for practical MCMC 
computations. In Bayesian inverse problems (especially in applied problems pertaining to ordinary 
and partial differential equations, see [StulO] and the references there in) it is increasingly common 
to work with discretizations of size = 0(10^) or even larger. The effect of focusing on the optimal 
proposal design principal that emerges from the theory in this paper results in an 0{N) speed-up 
when compared with adopting the prevailing approach of tuning proposal variance. In practice this 
may mean the difference between running computations on a desktop rather than a multiprocessor 
computer, or between being able to evaluate desired expectations in reasonable computer time and 
not being able to do so, for example in on-line scenarios. Note also that the theory developed for the 
pCN variant of the RWM algorithm based on the optimal proposal design principle in this paper 
holds for any initial condition and is hence far more powerful than the theory based on tuning 
proposal variance, which works only in stationarity. 

One of the celebrated aspects of the optimal proposal variance approach, widely cited and used 
by applied workers in many fields, is that practitioners should tune the average acceptance proba- 
bility to be close to 0.234. There are even adaptive algorithms which are automated to obtain this 
acceptance probability. The two preceding theorems demonstrate that, whilst this choice of accep- 
tance probability may have some benefits for RWM when applied to certain target measures, the 
practitioner would have far greater impact on complexity simply by adapting the proposal mean to 
obtain pCN, when sampling target measures of the form (1.1); furthermore, when doing so there 
is no universal optimal choice of acceptance probability. 

These considerations should convince the reader that the stated Optimal Design Principle 
is worth pursing further. Before proceeding to describe how we establish the theorems above, and 
describe the structure of the paper, we discuss the context for this design principle. We note first 
that, as stated, the principle is wide-ranging in applicability, but (deliberately) non-constructive. 
As mentioned above, here we demonstrate a particular application of the principle, with wide- 
ranging applicability to modifications of RWM proposals. The paper [CRSWll] shows how many 

4 



other new methods can be constructed by adhering to the design principle, and includes numerical 
demonstrations of their capabilities, and reference to other papers containing further numerical 
experiments. In particular, in addition to variants on standard RWM proposals, the paper shows 
how to modify Langevin, Hybrid (Hamiltonian) Monte-Carlo and Metropolis-within-Gibbs methods 
algorithms so as to adhere to the design principle in the context of the target measure (1.1). Finally 
we note that that the optimal design principle has been known to improve the efficiency in many 
finite dimensional contexts. In the specific context of determining diffusion constants for partially 
observed diffusions, the design principle is adopted in [RSOl] to break missing data/parameter 
dependencies and speed up algorithms. And in the context of spatial point processes, the principle 
is applied in [MW04]. The principle thus already has roots in the MCMC literature and the theory 
in this paper aims to both extend the understanding of this approach, and to provide theoretical 
tools to underpin it in the wide setting of Bayesian nonparametric problems. 

1.4. Diffusion Limits. We establish the two theorems from the previous subsection by use of 
diffusion limits for RWM and pCN. A central role is played by the equation 

(1.6) dz{t) = -h[z{t) + CV^(z)) dt + V2hdW 

where is a Wiener process on T-L with covariance operator C, and h is any positive scalar. This 
equation is vr— reversible and ergodic, satisfying a law of large numbers for sample path averages 
[HSV07b]. Given the discrete time Markov chain and the time increments tk = kS, we define its 
piecewise linear interpolation to be: 

z^(t) = ^(t-tfc)x'=+i'^ + ^(tfc+i-t)x*^'^ for tk<t<tk+i. (1.7) 

Our key tool in proving the theorems of the last subsection will be the fact that converges weakly 
to z solving (1.6). 

We pause to note that existence of a diffusion limit for the pCN algorithm is not entirely sur- 
prising. After all, most RWM algorithms evolve via local moves obtained by the discretization of 
an underlying diffusion and thus in the small noise limit will have a diffusive behavior. However 
it is note worthy that both RWM and pCN converge to the same diffusion, and that the relative 
efficiency of the two algorithms can be understood by examining the manner in which this limit is 
obtained. This is at the heart of demonstrating that the computational complexity of pCN is an 
order of magnitude smaller than that of RWM. And it is by virtue of conforming to the optimal 
design principle that we obtain this speed-up. The optimal design principle also applies to other 
algorithms, such as Hybrid Monte Carlo; but since such algorithms do not exhibit diffusive be- 
haviour, other tools will be needed to study the optimal design principle in this context; this point 
is address in the discussion in subsection 1.8. 

The idea of finding diffusion limits for MCMC methods was pioneered by Roberts and co-workers 
in [RGG97, RR98] for the RWM and MALA algorithms apphed to i.i.d. targets; see [RROl] for an 
overview. Because the target measure for this work is an independent product it is possible to find 
decoupled scalar diffusions in each coordinate. The recent paper [MPSll] shows that similar limit 
theorems can be obained for RWM applied to non-product measures of the form (1.1), by exploting 
the Gaussian prior structure. The limiting diffusion now couples all coordinates and is given by 
the stochastic partial differential equation (SPDE) (1.6). In this paper we demonstrate that such 
limit theorems may also be obtained for pCN when applied to (1.1). The difference between the 
complexity bounds contained in the two theorems from subsection 1.3 may now be understood 
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heuristically as follows. By the ergodic theorem, the SPDE (1.6) needs to reach time T = T{e) 
sufficiently large to obtain sample path averages which are 0{e) close to the ergodic average. The 
limit theorem for RWM requires that 5 = 0{N~^) and so 0{N) steps are required to reach time T. 
In contrast the limit theorem for pCN is valid on spaces of all dimensions for the same 6 sufficiently 
small; indeed the limit theorem holds on the infinite dimensional space %. Ks a. consequence the 
number of steps required to reach time T is 0(1) with respect to dimension for pCN. 

1.5. Insights from numerical analysis and optimal tuning of the proposal mean. An important 
point to be noticed is that the diffusion limits for RWM algorithms are constructed with a "discretize 
then design a sampler" viewpoint: firstly, the problem (target distribution) is discretized, secondly, 
a standard Random Walk Metropolis algorithm applied and then the algorithm is tuned to achieve 
optimality. The advantage is that the algorithms need not be defined on the entire Hilbert space 
but only on finite dimensional subspaces. In fact, the RWM is not well-defined on all of %. One of 
our key observations in this paper is that, this also turns out to be the key disadvantage of the RWM 
algorithm in high dimensions. In contrast, the pCN algorithms adhere to the "design a sampler 
then discretize" principle (see [HPUU08], Chapter 3, for discussion of similar issues for optimization 
problems on function spaces), where the algorithms are defined on all of and therefore, unlike 
RWM, do not suffer from the degeneracy due to discretization. 

To further illustrate our viewpoint on optimal proposal design, let us give some intuition see why 
o = (1 — 25)^/'^ is the right choice as suggested by the optimal design principle, behind the efficiency 
gain of pCN relative to all other proposals of the form (1.2). The key observation is that only the 
pCN proposal preserves the prior measure ttq. To see this note that the Ornstein-Uhlenbeck (OU) 
process 

dz = -zdt + V2dW (1.8) 
zo = X, 

is TTo— reversible and ergodic [DPZ96]; as for (1.6) here is a Brownian motion in T-L with covariance 
operator equal to C If t > then the exact solution of this equation has the form, for 6 = ^(1— e~^*), 

zit) = e-'x + ^((1-6-2*))^ 

= {l-26)^x + V26C, (1.9) 

where ^ ~ N(0, C). Thus the pCN proposal is an exact solution of this OU process. 

The consequence of this observation is as follows: if ^ = 0, the pCN algorithm started at 
stationarity has an acceptance probability of 1. For ^ 7^ the acceptance probability has the form 

a^{x,S,) = lAexp{^{x)-^{y)) (1.10) 

(see Section 3 for more details). This acceptance probability is well-defined on T-L and is 0(1) for 
any 6 G (0, ^). 

Now consider the proposal (1.2). It is straightforward to see that if x ~ ttq then y ~ -/V(0, (a^ + 26)C^ 
and hence the prior is invariant if and only if = 1 — 26. The acceptance probability now has the 
form 

a\x,C) = 1 Ae^p{la4x) - laM) (1.11) 
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where 

An important point to note here is that ||C~2x|| and ||C~2y|| are almost surely infinite with respect 
to ttq and hence vr. Hence, unless = 1 — 26, the exponent in the acceptance probability approaches 
infinity as the dimension N ^ oo, for fixed 6. To control this effect for 7^ 1 — 25 requires (5 — )• 
as — )• 00; this is at the heart of the theories of optimal proposal variance [RROl]. Since the 
number of steps to reach time T is inversely proportional to 6^^ in the diffusion limit scaling, this 
increases the computational complexity by a factor of 6^^{N) when compared with the algorithm 
for = 1 — 26 for which 6 may be chosen independently of A^. To be concrete we will confine 
attention in this paper to the cases a = 1, which is a standard RWM method, and = 1 — 26 
which gives the pCN method that we are advocating. However similar results could be derived for 
other values of a and would exhibit similar results to those for the RWM if a = 1 + 0{6). 

1.6. Simulated Annealing. We now describe a secondary motivation for study of the algorithms 
in this paper. There are many applications, including inverse problems in engineering and calculus 
of variations problems from materials science ( [EHN96] , [Dac89] ) where it is of natural interest to 
find global or local minima of a functional 

(1.12) J{x) = ^\\C-^/^xf + '^{x) , 

where C is the self-adjoint, positive and trace-class linear operator described above, on the Hilbert 
space (y., (•,•), II • 11^ . Gradient flow or steepest descent is a natural approach to this problem, but 
in its basic form requires computation of the gradient which, in some applications, may be an 
expensive or complex task. In addition, when mutiple minima are present, it may be important to 
include noise within the algorithm in order to allow escape from local minima. The second goal 
of this paper is to show how a noisy gradient descent can emerge, with out explicitly computing 
the gradient, from certain carefully specified random walks, when combined with a Metropolis- 
Hastings accept-reject mechanism [Tie98], with tunable noise level r. In the finite state [KJV83, 
Cer85] or finite dimensional context [Gem85, GH86, HKS89] the idea of using random walks, with 
accept-reject, to perform global optimization is a well-known idea which goes by the name of 
simulated-annealing; see the review [BT93] for further references. The novelty of our work is that 
the theory is developed on an infinite dimensional Hilbert space, and the applications of the theory 
are particularly tailored towards practical problems arising in Bayesian nonpar ametrics. In this 
context, minimization of J given by (1.12) corresponds to finding the MAP estimator. 

In finite dimensions the basic idea behind simulated annealing is built from Metropolis-Hastings 
methods which have an invariant measure with Lebesgue density proportional to exp(— J(x)) . 
By adapting the temperature r G (0, 00) according to an appropriate cooling schedule it is possible 
to locate global minima of J. The essential challenge in transfering this idea to infinite dimensions 
is that there is no Lebesgue measure. However, our key observation in this paper is that, the 
non-existence of the Lebesgue measure can be circumvented by working with measures defined via 
their density with respect to a Gaussian measure like the posterior measure tt defined in (1.1). To 
introduce the parameter r, we modify our prior distribution ttq and write 

7r5 = N(0,rC), (1.13) 
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so that the posterior distribution vr"^ takes the form: 



(x)ocexp —]■ (1-14) 



Note that if 'H. is finite dimensional then vr"^ has Lebesgue density proportional to exp(— r^^ J(x)) . 
It is also known that small ball probabilities are asymptotically maximized (in the small radius 
limit), under vr"^, on balls centred at minimizers of J [DSll]. To further incorporate the parameter 
T, we modify the pCN proposal to be 

y = (l-2(^)5x + \/2^^ (1.15) 

where ^ ~ N(0, C) and the "proposed move" (1-15) will be accepted or rejected with probability 
found from pointwise evaluation of ^' given by, 

a\x,i) = 1 Aexp(^-i(^'(y) - ^(x))) (1.16) 

(see Section 3 for more details) resulting in a Markov chain {x^'^}i^^i+ . 

A small generalization of the diffusion limit for pCN discussed in subsection 1.4 shows that, as 
6 goes to 0, the linear interpolation process z^{t) given by (1.7) now converges to the diffusion 

(1.17) dz{t) = - (^z{t) + CV^'(z)) dt + V2^dW 

where is a Wiener process on H with covariance operator C. This equation is vr'^— reversible, 
ergodic and satisfies a law of large numbers with respect to the measure [HSV07b]. Since small 
ball probabilities under vr'^ are maximized when centred at minimizers of J, the result thus shows 
that the algorithm will generate sequences which concentrate near minimizers of J. Varying r 
according to a cooling schedule then results in a simulated annealing method on Hilbert space. 
Weak convergence results for the approximation of stochastic equations in infinite dimensions may 
be found in the numerical analysis literature. For the heat equation and variants see [Sha03, DP09, 
GKL09, KLLIO], for dispersive and nondispersive wave problems see [HaulO, dBD06] and for delay 
equations see [BS05, BKMS08]. These papers rely on use of the Kolmogorov equation to establish 
weak convergence and do not typically deliver convergence on pathspace, but rather convergence 
of functionals at a given fixed time. In contrast our approach, which is much simpler, proves weak 
convergence on pathspace, and does not use the Kolmogorov equation; rather we use an invariance 
principle for Brownian motion in Hilbert space [Ber86], coupled with the preservation of weak 
convergence under continuous mappings. Our approach as it stands, does not deliver rates of weak 
convergence, but can be made more quantitative to obtain convergence rates. Since we are only 
interested in qualitative properties and their statistical applications, we do not venture in this 
direction. 

Let us give a quick heuristic to see why the gradient flow emerges through the pointwise com- 
putation of ^ and the accept-reject mechanism. Note that for 6 <^ 1, we see from (1.16) that 

(1.18) a^(x,0~lAexp 

This induces a bias towards accepting moves for which the the Gaussian random variable ^, which 
is independent of x, aligns with the negative gradient of ^. Formalizing this heuristic is the content 
of Section 5. 




Because the SDE (1.6) and the Markov chain are not started at stationarity, and because nei- 
ther possess a smoothing property (they are only asymptotically strong Feller [HSV07a]), almost 
sure fine scale properties under its' invariant measure vr"^ are not necessarily reflected at any fi- 
nite time. For example if C is the covariance operator of Brownian motion or Brownian bridge 
then the quadratic variation of draws from the invariant measure, an almost sure quantity, is not 
reproduced at any finite time in (1.6) unless 2;(0) has this quadratic variation; the almost sure 
property is approached only asymptotically as t — )■ oo. This behaviour is reflected in the underlying 
Metropolis-Hastings Markov chain pCN which approximates (1.6), where the almost sure property 
is only reached asymptotically as — )• oo. A third theorem that we highlight in this introduction 
(informally stated here and rigorously formulated and proved in Section 6) gives quantitative in- 
formation about the rate at which the pCN algorithm approaches statistical equilibrium. 

Theorem 3: Convergence of Almost Sure Quantities The almost sure quantities such as the 
quadratic variation under pCN satisfy a limiting linear ordinary differential equation (ODE) with 
globally exponentially attractive steady state given by the value of the quantity under ir'^ . 

We observe that the diffusion limit results obtained in this paper are entirely self-contained 
and the proof technique is analogous to the proof of diffusion limits of Markov chains in flnite 
dimensions. We believe therefore that the methods of analysis that we introduce may be used to 
understand other MCMC algorithms and other nonparametric problems. 

1.7. Structure of Paper. The paper is organzed as follows. In section 2 we describe some nota- 
tions used throughout the paper, discuss the required properties of Gaussian measures and Hilbert- 
space valued Brownian motions, and state our assumptions. In this section, we also exhibit a large 
class of statistical problems arising in practice which satisfy our assumptions. Section 3 contains a 
precise deflnition of the Markov chain {x^'^}i^^i+ : together with statement and proof of the weak 
convergence theorem to a diffusion that is the main theoretical result of the paper. Section 4 ex- 
plains the consequences of this theory for computational complexity, stating precisely, and proving, 
the theorems on Optimal Proposal Mean and Optimal Proposal Variance from subsection 
1.3. Section 5 contains proof of lemmas which underly the weak convergence theorem of section 3. 
In section 6 we state and prove the limit theorem for Convergence of Almost Sure Quantities 
such as quadratic variation; such results are often termed "fluid limits" in the applied probability 
literature. A simulation example illustrating the main result is presented in section 7. We conclude 
in section 8. Proofs of some technical lemmas are deferred to the Appendices A and B. 

1.8. Discussion and Outlook. The basic design principle which we highlight in this paper is 
explained with reference to the random walk Metropolis algorithm and analysis of the computational 
complexity is undertaken via the use of diffusion limits. It is important that the reader appreciate the 
following two points, (i) The basic design principle is applicable far beyond the specific context of the 
random walk Metropolis methods; the paper [CRSWll] demonstrates the breadth of applicability of 
the idea, and contains a range of computational experiments to substantiate the benefits of adopting 
the new algorithms, (ii) The basic design principle can be understood theoretically from many 
different perspectives, and the recently developed spectral gap analyses of pCN and RWM [HSVll], 
using the 1-Wasserstein distance, provides a depth of understanding that is likely to transfer to 
many other situations where diffusion limits do not necessarily arise naturally; in particular, unlike 
the spectral gap results arising from the seminal work of Meyn and Tweedie [MT93] , the work in 
[HSVll] is constructed to be well-adapted to infinite dimensional setting, and hence to analysis of 
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the basic design principle elucidated here. These links to recent emerging literature, and to other 
concluding observations, are made in section 8. 

2. Preliminaries. In this section we define some notational conventions, Gaussian measure 
and Brownian motion in Hilbert space, and state our assumptions concerning the operator C and 
the functional ^ . 

2.1. Notation. Let (•, •), || • ||^ denote a separable Hilbert space of real valued functions with 
the canonical norm derived from the inner-product. Let C be a positive, trace class operator on % 
and A|}j>i be the eigenfunctions and eigenvalues of C respectively, so that 

Cipj = Xj ipj for i G N. 

We assume a normalization under which {ipj}j>i forms a complete orthonormal basis in T-L. For 
every x G Ti we have the representation x = xjifj, where xj = {x, ipj). Using this notation, we 
define Sobolev-like spaces Ti^ , r E M, with the inner-products and norms defined by 

oo oo 

{x,y)r = ^f''xjyj and ||x||^ = ^ j^r ^^2.1) 
i=i i=i 

Notice that T-L^ = T-L. Furthermore T-L^ C H C 'H~^ for any r > 0. The Hilbert-Schmidt norm || • ||c 
is defined as 

II Il2 \-2 2 

j 

For r £ M, let Br : Ti Ti denote the operator which is diagonal in the basis {vj}j>i with diagonal 
entries j^'', i.e., 

D •2r 

Br (fj = 3 (fj 

1 

so that Br fj = j^^Pj- The operator Br lets us alternate between the Hilbert space H and the 
interpolation spaces via the identities: 

{x,y)r = {Bjx,Bjy) and \\x\\l = \\B?x\\^. (2.2) 

— 1 /2 —1/2 

Since \\Br = Hv'fcll = 1, we deduce that {Br fk}k>i forms an orthonormal basis for l-T . 

For a positive, self-adjoint operator D : W i— ?■ W, its trace in is defined as 

oo 

Tracenr{D) = ^{Br ^ipj,DBr Vj)r- 
j=i 

Since Trace-^r(D) does not depend on the orthonormal basis, the operator D is said to be trace 
class in T-L^' if Trace-^j (D) < oo for some, and hence any, orthonormal basis of H^'. Let (Si-^j- denote 
the outer product operator in V.^ defined by 

{x0n^ y)z = {y,z)rX Vx,y,ze'W. (2.3) 
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For an operator L : Ti^ i— )• Ti , we denote its operator norm by || • \\ defined by 

imiciH-m = sup \\Lx\\i. 

\\x\\r = l 

For self-adjoint L and r = I = tliis is, of course, the spectral radius of L. 
Throughout we use the following notation. 

• Two sequences {a„}„>o and {/3n}n>o satisfy a„ < /3„ if there exists a constant K > 
satisfying a„ < K(3n for all n > 0. The notations a„ x /3„ means that a„ < /?„ and /?„ < an- 

• Two sequences of real functions {fn}n>o and {5n}n>o defined on the same set satisfy 
fn ^ 9n if there exists a constant K > satisfying < Kg^ix) for all n > and all 
X G J7. The notations /„ x (^n means that /„ < (7n and gn ^ /n- 

• The notation E^^ [fix, ^)] denotes expectation with variable x fixed, while the randomness 
present in ^ is averaged out. 

2.2. Gaussian Measure on Hilbert Space. The following facts concerning Gaussian measure on 
Hilbert space, and Brownian motion in Hilbert space, may be found in [DPZ92]. Since C is self- 
adjoint, positive and trace-class we may associate with it a centred Gaussian measure ttq on T-L with 
covariance operator C, i.e., ttq == N(0, C). If x ~ ttq then we may write (Karhunen-Loeve) 

oo 

X = '^Xj Pj ipj with Pj ~ N(0, l)i.i.d. (2.4) 
i=i 

If we let {fl,T,F) denote the probability space for the iid sequence {^j}j>i then, since C is trace- 
class, the sum converges in L?'{Q.;'H). 

— 1/2 

Since {Br Vj}j>i forms an orthonormal basis for T-T , we may write (2.4) as 

oo 

^ = E(^J -?")/'i(^-"'^''^j) ^^^^ p,Sn(0,1) i.i.d. (2.5) 

i=i 

Define 

Cr = B}/'^ C Bll^ . (2.6) 

If Trace-^r(Cr.) = X^^i j"^^ < c« for some r > then the sum (2.5), and hence the sum (2.4), 
converges in L?'{Q.;'hJ') to a centred Gaussian random variable x with covariance operator Cr in 

W , and C in T-L. Then for any u,v ^ W, if x ~ ttq, we have K[{x,u)r{x,v)r] = {u,Crv)r. Thus 
in what follows, we freely alternate between the Gaussian measures N(0, C) on H and N(0, Cr) on 
We note that 

oo 

E[||x||2] = ^Xjf' = Trace«.(a). 
i=i 

Frequently in applications the functional ^ arising in (1.12) may not be defined on all of Ti, 
but only on a subspace "H* C Ti, for some exponent s > 0. Even though the Gaussian measure ttq 
is defined on Ti, depending on the decay of the eigenvalues of C, there exists an entire range of 
values of r such that Trace-^r-(Cr) < oo: in that case the measure ttq has full support on Ti^, i.e., 
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TToiJ-r) = 1. Indeed, the condition Tiacey^riCr) < oo also implies that for any r > the measure 
ttq has full support on Ti^ . From now onwards we fix a distinguished exponent s > and assume 
that : ^ M and that Tracers (C^) < oo. Then ttq = N(0, C) on H and Tr{W) = 1. For ease of 
notations we introduce 

_ 1 

ipj = Bs fj for J > 1- 

The family {ipj}j>i forms an orthonormal basis for (Ti^, {■, ■)s) ■ We may view the Gaussian measure 
ttq = N(0, C) on (•, •)) as a Gaussian measure N(0, Cg) on [H^, (•, •)s) • 

A Brownian motion {W{t)}t>o in Ti'^ with covariance operator Cg ■ Ti^ — ?• is a continuous 
Gaussian process with stationary increments satisfying K[{W{t),x)s{W{t),y)s] = t{x,Csy)s- For 
example, taking {/3j(t)}j>i independent standard real Brownian motions, the process 

w{t) = Y,{f^j)m^j (2-7) 

defines a Brownian motion in T-L^ with covariance operator Cg', equivalently, this same process 
{W{t)}t>o can be described as a Brownian motion in T-L with covariance operator equal to C since 
Equation (2.7) may also be expressed as W{t) = YlJLi 

2.3. Assumptions. In this section we describe the assumptions on the covariance operator C 
of the Gaussian measure ttq = N(0, C) and the functional ^. For each x E T-L^ the derivative 
V^(x) is an element of the dual (Ti^)* of Ti^, comprising the linear functionals on Ti'^. However, 
we may identify {'H'^)* = H."^ and view V^'(3;) as an element of for each x G Ti^ . With this 
identification, the following identity holds 

\\v'i>{x)\\c(w^m = liv^'(x)||_, 

and the second derivative 5^^'(x) can be identified with an element of C{T-L^ ,'H~^). To avoid tech- 
nicalities we assume that ^{x) is quadratically bounded, with first derivative linearly bounded and 
second derivative globally bounded. Weaker assumptions could be dealt with by use of stopping 
time arguments. 

Assumptions 2.1. The functional ^ and the covariance operator C satisfy the following as- 
sumptions. 

Al. Decay of Eigenvalues A| of C: there exists a constant k > ^ such that 

(2.8) 

A2. Domain of there exists an exponent s G [0,/« — 1/2) such ^ is defined on T-L^. 
A3. Size of ^: the functional ^ : Ti'^ — )• M satisfies the growth conditions 

< ^(x) < l + \\x\\l 

A4. Derivatives of ^: The derivatives of ^ satisfy 

||V^'(x)||_, < l + ||x|U and \\d''^{x)\\cin^n-^) ^ 1- 

Remark 2.2. The condition k > ^ ensures that Trace-^r(Cr) < 00 for any r < k — ^: this 
implies that 'Kq{'W) = 1 for any r > and r < k — ^. 
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Remark 2.3. The functional ^( defined on Ti^ and its derivative at x G H'^ is 

given by V'^{x) = J2j>oj^^^j^j ^ '^'''^^ = \\^\\s- The second derivative d^"^{x) G 

C{l-L^ ^T-L'^) is the linear operator that maps u G Ti.^ to Yl,j>Qj'^^{'^^'Pj)'fj ^ its norm satisfies 

= 1 for any x G W . 

The Assumptions 2.1 ensure that the functional ^' behaves well in a sense made precise in the 
following lemma. 

Lemma 2.4. Let Assumptions 2.1 hold. 

1. The function d{x) = — + CV^(a;)^ is globally Lipschitz on T-L^ : 

\\d{x)-d{y)\\s < \\x-y\\s Vx,yG?^^ (2.9) 

2. The second order remainder term in the Taylor expansion of ^ satisfies 

\^{y)-^{x)- {V^[x),y-x)\<\\y-x\\l \fx,yen'. (2.10) 

Proof. See [MPSll]. □ 

2.4. Statistical Models Satisfying Our Assumptions.. We demonstrate two classes of models 
which satisfy our assumptions. We emphasize, however, that our theoretical development has not 
been optmizied with respect to the assumptions made. We could weaken the assumptions at the 
price of considerably more involved proofs. In broad terms we expect the ideas in this paper to 
apply whenever interest is focussed on sampling measures which possess a well-behaved density 
with respect to a Gaussian random field prior. 

2.4.1. Nonparametric Regression.. Consider estimating an unknown function x observed at dis- 
crete points on a compact set: 

yk = x{tk) + ek (2.11) 

where G 7" C M*^ and are i.i.d errors, which are mean and have a smooth density g{-). The 
unknown function is assumed to be in the Hilbert space Ti = L2(T), i.e., x G L2{T). We assume a 
centered Gaussian random field prior on the function x, ttq = N(0,C) satisfying (2.8). Clearly, the 
(negative) log-likelihood is given by 

^(^) = - X]^°S5(yfc - x{tk)) . 

k 

By ensuring enough decay of eigevalues, there are a large class of covariance operators C so that 
item 1 of Assumptions 2.1 is satisfied. Indeed, since g{-) has Gaussian or heavier tails, the function 
^ grows at most quadratically, satisfying item 2 of Assumptions 2.1. By virtue of g{-) belonging 
to an exponential family, the functional ^ is smooth. Furthermore imposing the condition on the 
growth of the function x at infinity (which is related to the eigenvalues of the prior covariance C), 
the required bounds on the derivatives of ^ can be obtained. 

As mentioned above the theoretical properties of these regressions models are very well studied. 
In fact sharp rates of posterior concentration are known, and remarkable connections are established 
between the convergence rates of the posterior distribution and the small ball probabilities of the 
prior distribution [VDVVZ08, CasOS] which in turn is related to the eigenvalues of the covariance 
operator. Interesting results are also known from a minimax perpsective [ZhaOO]. 
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2.4.2. Statistical Inference for Diffusions. Consider the following stochastic differential equa- 
tion: 

dXt = {AXt - BB'VV{Xt, 9)) dt + B dWt (2.12) 

where ^ is a drift function, A, B £ M^^"^ and 9 gR'^ is an unknown parameter. The statistical goal 
is to estimate 9 from the discrete observations Xt,, = x^, tk G [0, T]. SDE models of the above kind 
have received tremendous attention in recent years (see [BRSV08] and the references therein). 

From a Bayesian point of view, a natural way to proceed (after carefully choosing the prior 
distribution for 9) is to obtain posterior samples via data augmented Gibbs sampling. This will 
involve two steps: sampling the conditional distribution of 9 given the augmented (full) path Xt, t E 
[0,T] and sampling the conditional distribution of the augmented path Xt,t G [0,T] given 9, 
satisfying the constraint Xt,. = Xk- Sampling conditional diffusions (i.e., diffusions Xt conditioned 
to satisfy a few values Xt^, = x^) is a very challenging problem in general since their dynamics are 
quite intractable. 

However for SDEs given in (2.12), there is a simple and efficient way to proceed. Let vr denote 
the law of the diffusion bridge Xt conditioned to have Xq = xq and Xi = xi (more points can be 
easily dealt with using the Markovian property of the diffusion Xt). Notice that vr is a probability 
measure on ^ = L2[[0, 1], R"^]. If F = 0, then we obtain 

dXt = AXt dt + B dWt, Xq = xo,Xi = xi (2.13) 

which is a Gaussian process whose dynamics is more tractable. Let vro denote the law of the diffusion 
given in (2.13). Then it is known that ([BRSV08]) vr is absolutely continuous with respect to vro on 
n = L2[[0,l],M'^] with 

^iX) = exp{-^iX)} (2.14) 
dvro 

^(X) = (1, ^liX)), ^i(x) = \B-'f{x)\^ + ^div/(x) + fixYiBB'y'Ax 

f{x) = -BB'VV{Xt,e) . 

Thus from (2.14) we see that the target measure vr satisfies our formulation. Moreover, under mild 
regularity conditions, it can be shown that the functional ^ in (2.14) satisfies Assumptions 2.1 
[BRSV08]. 

There are many more practical applications including image processing, non-linear function esti- 
mation from partial differential equations which model physical phenomena and signal processing, 
where the target distributions satisfy our formulation and assumptions, see [CRSWll] for a detailed 
account. 

3. Diffusion Theorem. This section contains a precise statement of the algorithm, statement 
of the main theorem showing that piecewise linear interpolant of the output of the algorithm 
converges weakly to a noisy gradient flow, and proof of the main theorem. The proof of various 
technical lemmas is defered to section 5. 

3.1. pCN Algorithm. We now define the Markov chain in which is reversible with respect 
to the measure vr"^ given by Equation (1.14). This is the pCN algorithm with proposal (1.4). Let 
X G W be the current position of the Markov chain. The proposal candidate y is given by (1.9), so 
that 

y= (l-2(5)^x + \/2^C where ^ = N(0,C) (3.1) 
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and S G (0, ^) is a small parameter which we will send to zero in order to obtain the noisy gra- 
dient flow. In Equation (3.1), the random variable ^ is chosen independent of x. As described in 
[BRSV08] (see also [CDSll, StulO]), at temperature r G (0, c«) the Metropolis-Hastings acceptance 
probability for the proposal y is given by 

a^{x, = 1 A exp(-i (^(y) - ^(x))) . (3.2) 

The chain is then reversible with respect to vr"^. The Markov chain x^ = {x^'^}k>o can be written 
as 



^k,5yk,s ^ _ ^k,s^ ^k,5 ^j^g^g yk,5 ^ _ 2 + V257e^ (3.3) 

Here the are iid Gaussian random variables N(0, C) and the 7'^''^ are Bernoulli random variables 
which account for the accept-reject mechanism of the Metropolis-Hastings algorithm, 

^M4£f^5(^M^^fc) S BernouUifa^(a;'='^C*^)). (3.4) 



The function 7'^(x,^) can be expressed as J^{x,^) = '^{u<a^(x,()} where U ~ Uniform(0, 1) is 
independent from any other source of randomness. The next lemma will be repeatedly used in the 
sequel: it states that the size of the jump y — x is of order 

Lemma 3.1. Under Assumptions 2.1 and for any integer p > 1 the following inequality 
M\\V-^\\V\' < SMs + V6 < V6{l + \\x\\s) 

holds for any 5 G (0, 

Proof. The definition of the proposal (3.1) shows that \\y - < 6^ \\x\\^ + 52E[||^||f]. 
Fernique's theorem [DPZ92] shows that ^ has exponential moments and therefore E[||^||f] < 00. 
This gives the conclusion. □ 

For future use, we define the local mean acceptance probability at the current position x via the 
formula 

a\x) = E^. a\x,0 . (3.5) 

3.2. Diffusion Limit Theorem. Fix a time horizon T > and a temperature r G (0,oo). The 
piecewise linear interpolant of the Markov chain (3.3) is defined by Equation (1.7). The following 
is the main result of this article. Note that "weakly" refers to weak convergence of probability 
measures. 

Theorem 3.2. Let Assumptions 2.1 hold. Let the Markov chain x^ start at fixed position G 
"H*. Then the sequence of processes converges weakly to z in C([0, T]; ?^**), as 5 — )■ 0, where z 
solves the T-L^ -valued stochastic differential equation 



dz = -(z + CV^{z)]dt + V2TdW (3.6) 

Zq = X* 



and W is a Brownian motion in Ti^ with covariance operator equal to C, 
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For conceptual clarity, we derive Theorem 3.2 as a consequence of the general diffusion approxi- 
mation Lemma 3.5. Consider a separable Hilbert space [W, (•,-)s) and a sequence of ^**-valued 
Markov chains = {x'''^}k>o- The martingale-drift decomposition with time discretization 6 of 
the Markov chain reads 

= x^^^ + d\x^'^) 6 + V2^ T^{x^'\C^) 
where the approximate drift and volatility term T^{x,S,'') are given by 

d^{x) = (5-1 E[x''+^'^ - x'''^ |x^'^ = x] (3.8) 
r\x, = (2r5)-i/2 (^^k+i,s _ ^k,5 _ jg _ ^k,5 ^ ^] y 

Notice that [V'"'^]^^^, with T^'^ = T^x^'^Sj'), is a martingale difference array in the sense that 
is a martingale adapted to the natural filtration = {-F^''^}fc>o of the Markov 
chain x^ . The parameter 6 represents a time increment. We define the piecewise linear rescaled 
noise process by 



k 



W\t) = V5 V r^'^ + r'^+i'^ for tk<t< tk+i. (3.9) 

,=0 V5 

We now show that, as 5 — )• 0, if the sequence of approximate drift functions d^{-) converges in the 
appropriate norm to a limiting drift d{-) and the sequence of rescaled noise process converges to 
a Brownian motion then the sequence of piecewise linear interpolants defined by Equation (1.7) 
converges weakly to a diffusion process in Ti'^. In order to state the general diffusion approximation 
Lemma 3.5, we introduce the following: 

Conditions 3.3. There exists an integer p > 1 such that the sequence of Markov chains x^ = 
{x'''^}k>o satisfies 

1. Convergence of the drift: there exists a glohaly Lip schitz function d : — )■ such that 

\\d'{x)-d{x)\\s < 6-{l + \\xE) (3.10) 

2. Invariance principle: as 6 tends to zero the sequence of processes {W^}^^^^ 1^ defined by 
Equation (3.9) converges weakly in C{[0,T],'H'^) to a Brownian motion W in T-L'^ with covari- 
ance operator Cs ■ 

3. A priori bound: the following hound holds 

sup \\x^'^\\i\ I < 00. (3.11) 

Remark 3.4. The a-priori bound (3.11) can equivalently be stated as sup^^f^^ |e Jp'" ||z^(ti)||f dn 
00. 
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It is now proved that Conditions 3.3 are sufficient to obtain a diffusion approximation for the 
sequence of rescaled processes defined by equation (1.7), as b tends to zero. 

Lemma 3.5. (General Diffusion Approximation for Markov chains) 

Consider a separable Hilbert space (Ti^ ,{■,■) s) and a sequence ofW -valued Markov chains = 
{x'''^}k>o starting at a fixed position x=k G H'^ , 

= X, V<5 e (0;1). 

Suppose that the drift-martingale decompositions (3.7) of x^ satisfy Conditions 3.3. Then the se- 
quence of rescaled interpolants S C([0,r],^*) defined by equation (1.7) converges weakly in 
C{[0,T],T-L^) to z £ CdOjTjjT-L^) given by the stochastic differential equation 



dz = d{z) dt + V2TdW (3.12) 

Zo = X* 

where W is a Brownian motion in %" with covariance Cs- 

Proof. For the sake of clarity, the proof of Lemma 3.5 is divided into several steps. 

• Integral equation representation. 

Notice that solutions of the T^'^-valued SDE (3.12) are nothing else than solutions of the 
following integral equation, 

z{t)=x^+[ d{z{u))du + V^rW{t) VtG(0,r), (3.13) 
Jo 

where is a Brownian motion in T-L^ with covariance operator equal to Cg- We thus introduce 
the Ito map 9 : C{[0,T],'H') C([0,r],^^) that sends a function W G C([0,T],?^*) to the 
unique solution of the integral equation (3.13): solution of (3.12) can be represented as 0(1^) 
where W is an ''-valued Brownian motion with covariance Cg. As is described below, the 
function Q is continuous if C([0, T], T^'') is topologized by the uniform norm ||u'||(7([o,t],'H=) = 
sup{||tt;(t) lis : t S (0,T)}. It is crucial to notice that the rescaled process z^ , defined in 
Equation (1.7), satisfies 



1 



z' = @{W') where W' {t) := W%t) + ^ j [d%z%u)) - diz\u))]du. (3.14) 

v2r 







In Equation (3.14), the quantity d^ is the approximate drift defined in Equation (3.8) and z^ 
is the rescaled piecewise constant interpolant of {x^'^}k>o defined as 

z^{t) = x^'^ for tk<t<tk+i. (3.15) 

The proof follows from a continuous mapping argument (see below) once it is proven that 

converges weakly in C{[0,T],'H') to W. 
The Ito map 6 is continuous 

It can be proved that © is continuous as a mapping from ^C([0, T], 7^*), || • ||c{[o,t],'H'')) to 
itself. The usual Picard's iteration proof of the Cauchy-Lipschitz theorem of ODEs may be 
employed: see [MPS 11]. 
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The sequence of processes converges weakly to W 

The process W^{t) is defined by W^{t) = W^{t) + ^ J^[d\z^{u)) - d{z^{u))] du and Con- 
ditions 3.3 state that converges weakly to W in C([0, T], 7^*). Consequently, to prove 
that W^t) converges weakly to W in C([0, T], ?^*), it suffices to verify that the sequences of 
processes 



(w, t)^ [d\z\u)) - d{z\u))\ du 



(3.16) 



converges to in probability with respect to the supremum norm in C([0, T],l-L'^). By Markov's 
inequality, it is enough to check that 



hm E 

5-s.O 



d^{z\u))-d{z\u))\\s du 



0. 



Conditions 3.3 states that there exists an integer p > 1 such that — < 5-(l + ||x| 

so that for any < u < t/c+i we have 



d\-z\u)) - d{-z\u)) <6{l + \\-z\uW;) =5{l + 



„k,5 up 



(3.17) 



Conditions 3.3 states that d{-) is globally Lipschitz on T-L^ . Therefore, Lemma 3.1 shows that 



E\\d{z\u)) - d{z\u) 



< El 



X 



k,S\ 



(3.18) 



From estimates (3.17) and (3.18) it follows that \\d\z\u)) - d{z\u))\\s < 6^1 + ||x'=''^||f) 
Consequently 



E 



d^{z\u))-d{z'{u))\\s du 



< 6^ 



1 + \\x 



k,5 up 



kS<T 



(3.19) 



The a- priori bound of Conditions 3.3 shows that this last quantity converges to as (5 converges 
to zero, which finishes the proof of Equation (3.16). This concludes the proof of W^{t) =^ W. 
Continuous mapping argument. 

It has been proved that is continuous as a mapping from ^C([0, T], ^*), || • ||c([o,t],'H^)) 
to itself. The solutions of the "^^^-valued SDE (3.12) can be expressed as 0(VF) while the 
rescaled continuous interpolate also reads z^ = Q{W^). Since converges weakly in 
(^C{[0,T],'H^), II • ||c{[o,T],'H=)) to as 5 tends to 0, the continuous mapping theorem ensures 

that z^ converges weakly in (^C{[0,T],'H^), \\ ■ ||c([o,t],'H=)) to the solution Q{W) of the Ti^- 
valued SDE (3.12). This ends the proof of Lemma 3.5. 

□ 



In order to establish Theorem 3.2 as a consequence of the general diffusion approximation Lemma 
3.5, it suffices to verify that if Assumptions 2.1 hold then Conditions 3.3 are satisfied by the Markov 
chain defined in section 3.1. In section 5.2 we prove the following quantitative version of the 
approximation d^ ^ d where d{x) = — ( x + CV^(x) ) : 
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Lemma 3.6. (Drift estimate) 

Let Assumptions 2.1 hold and let p > 1 be an integer. Then the following estimate is satisfied, 

\\d'{x)-d{x)E<5Hl + \\x\\f). (3.20) 

Moreover, the approximate drift d^ is linearly hounded in the sense that 

\\d\x)\\s < l + \\x\\s. (3.21) 

It follows from Lemma (3.6) that Equation (3.10) of Conditions 3.3 is satisfied as soon as Assump- 
tions 2.1 hold. The invariance principle of Conditions 3.3 follows from the next lemma. It is proved 
in section 5.5. 

Lemma 3.7. (Invariance Principle) 

Let Assumptions 2.1 hold. Then the rescaled noise process W^{t) defined in equation (3.9) satisfies 

^ W 

where =^ denotes weak convergence in C([0, T];^*), and W is a T-L^ -valued Brownian motion with 
covariance operator Cg. 

In section 5.4 it is proved that the following a priori bound is satisfied, 

Lemma 3.8. (A priori bound) 

Consider a fixed time horizon T > and an integer p > 1. Under Assumptions 2.1 the following 
bound holds, 



{5-e[Y^ \\x'''\\p\ :<5e(0,^)} < oo. (3.22) 



sup 

k5<T 

In conclusion, Lemmas 3.6 and 3.7 and 3.8 together show that Conditions 3.3 are consequences of 
Assumptions 2.1. Therefore, under Assumptions 2.1, the general diffusion approximation Lemma 
3.5 can be applied: this concludes the proof of Theorem 3.2. 

4. Implications for Computational Complexity. In this section we state and prove precise 
versions of Theorems 1 and 2 from the subsection 1.3 in the introduction. Recall that the target 
measure vr is defined by (1.1): 

djr 

— (x) = Mq, exp(-^(x)). 
otvro 

and our goal is to quantify the computational complexity of the RWM and pCN algorithms designed 
to approximate expectations of the form J fdir for suitable test functions / : "H M. In practice a 
natural way to discretize the RWM and pCN algorithms is to sample the target measure vr^, the 
measure obtained by projecting vr onto the first N eigenfunctions of C. To define this recall that 

Cipj = X]ipj, j G N. 

Define by the orthogonal projection in T-L onto the span of {(pj}jLi and ttq = N{0, CP^). 
Then set 

(x) = M^^N exp(-^(P^x)) . 
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The resulting Markov chain to sample vr^ on % may be implemented on 



We initialize at 



„o,<5 



x^'" = /^"'x*, X* G T-L. We let denote expectation with respect to tt^ for fixed starting point 
x^'^ in the Markov chain, and let E'^ '* denote expectation with respect to vr'^ for x=f ~ vr or 
equivalently x^'^ ~ tt^. 

We may now prove Theorem 1, concerning the pCN method. This follows from an application 
of Theorem 3.2 on P^T-L with r = 1. Recall that C{N) is the cost of implementing a single 
proposal of RWM, including evaluation of the acceptance probability; and that (1.5) defines fN,a,5 
the estimator of the expectation of / using the Markov chain pCN or RWM for a = (1 — 2(5)^/^ and 
a = 1 respectively. 

Theorem 4.1. Let Assumptions 2.1 hold and let f : Ti.^ — )■ M 6e hounded with hounded Frechet 
derivative. Let a = {1 — 26)^^'^. Then for any fixed x=k in the support of it and any e > 0, there is 
X G N and (5c G (0, both independent of N and Nc €z ^ such that, for S < 5c and N > Nc 

|lE"'^/7V,a,5-IEV| <e. 
Thus the computational cost grows to achieve this error grows with N as C{N). 

Proof. We first observe that the limit Theorem 3.2 holds on P^T-L and the corresponding 
hmiting SPDE is given by (1.17) with ^' = ^{P^-) and C ^ = P^CP^. This fohows from 
inspection of the proof, using the fact that the theorem is proved on the infinite dimensional space % 
and the observation that the functional ^{P^ ■) satisfies all of the Assumptions 2.1 with constants 
independent of A^. We denote the limiting solution by z^^\ 

We then use the following decomposition of the error: 



fN,a,S-^^f\ < |IE" /jV,a,5-IE" 



< 



N 



N,a,5 



/| + |E- / 

T 



IE'' / 





E''/|. 



E^" f{z^^>{s))ds 







+ 







T 



E-"f{z''{s))ds 



N 

E'^ f 



First choose sufficiently large so that the third item is less than e/3. The existence of such an 
A^ follows from Theorem 4.6 of [StulO] and the Lipschitz continuity of /. The second item may 
be made less than e/3 by the properties of the SPDE (1.17) and the ergodic Theorem 4.11 from 
[HSV07b], again for some T = T[e) indpendent of A^. Without loss of generality we may choose T 
so that it is an integer mutiple of 5. With this choice we consider the first item on the right-hand 
side and note that it may be written as 



N 

IE'' /Af,a,<5 



1 

T 



E-"/(z(^)(s))ds 



< 



T 



T 



+ 



E-'^/(#) -E-'^/(zW(s))ds) 
E-"/(z-^)-E-"/(/(s)))d. 
ij[^(E-"/(z^)-E''"/(. 



z^''\s)))ds 



Now the first integral tends to zero by application of the estimate (3.18) with / replacing d, noting 
that / is Lipschitz; and the second integral tends to zero by the weak convergence given in Theorem 
3.2 on P^T-L with r = 1. Both these require choosing 5 sufficiently small, independently of A^; it 
is then possible to ensure that the third term in the preceding bound is also less than e/3 and the 
theorem is proved. □ 
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To prove Theorem 2 we use the following theorem from [MPSll]: 



Theorem 4.2. Let {x^'^^ he the Markov chain on corresponding to the RWM proposal 
applied to sample vr^ and he as defined in (1.7). Then the following hold: 

1. If the chain is started in stationarity for the "optimal scale" 5 = £/N then the average accep- 
tance prohability converges to a numher a{t) G (0, 1) and the linear interpolation process 
converges weakly to the diffusion (1.6) on the space of paths, C([0,T],7^) with z(0) = x and 
W a Wiener process on % with covariance operator C, for some h{£) S (0,oo). 

2. The speed factor h{l) is maximized over the parameter i if the average acceptance probahility 
is chosen to he 0.234 to three decimal places. 

Now we precisely state and prove our claim about the complexity of the RWM: 

Theorem 4.3. Let Assumptions 2.1 hold and let f : Ti ^ M he hounded with hounded Frechet 
derivative. Set a = 1 and 5 = i/N where i is the 'optimal value' from Theorem 4-2. Then for 
xq ~ vr^, any e > and K = KqN there is Kq € N independent of N and £ N such that, for 
N>Nc 

\E^''fN,a,s-^''f\ <e. 
Thus the computational cost grows to achieve this error grows with N as N x C{N). 

Proof. The proof is similar to that of Theorem 4.1, but using the limit Theorem 4.2 in place of 
Theorem 3.2. As a consequence we have 6 = i/N and, since K = T/5 we obtain K = \TN/l\ . □ 

Thus we see that, even if one tunes the variance of RWM 'optimally' (in the sense of optimal 
scaling) the computational gain is negligible when compared to the gain obtained by tuning the 
mean to obtain the pCN algorithm. This reenforces the take home message of the paper, namely 
that for nonparametric inference, the design of tailored proposals, which take into account infinite 
dimensional structure, can have significant impact on computational complexity. 

5. Key Estimates. This section assembles various results which are used in the previous 
section. Some of the technical proofs are deferred to the appendix. 

5.1. Acceptance Prohahility Asymptotics. This section describes a first order expansion of the 
acceptance probability. The approximation 

a\x,i)^a^{x,i) where a^(x, ?) = 1 - ^ — (V^'(x), 0]I{(v*(:.),0>o} (5-1) 

is valid for 5 <^\. The quantity has the advantage over of being very simple to analyse: explicit 
computations are available. This will be exploited in section 5.2. The quality of the approximation 
(5.1) is rigorously quantified in the next lemma. 

Lemma 5.1. (Acceptance probability estimate) 

Let Assumptions 2.1 hold. For any integer p > 1 the quantity a^{x,^) satisfies 

E4\a'{x,0-aHx,0n < 6P{l + Mln- (5.2) 
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Proof. See Appendix A. □ 

Recall the local mean acceptance probability defined by a^{x) = Ex[a^{x^Sy\ in Equation (3.5). 
Define the approximate local mean acceptance probability by a^{x) == Ea;[Q;'^(x, ^)]. We now use 
Lemma 5.1 to approximate the local mean acceptance probability a^{x). 

Corollary 5.2. Let Assumptions 2.1 hold. For any integer p > 1 the following estimates hold, 



\aHx)-aHx) 



Proof. See Appendix A. 



(5.3) 
(5.4) 

□ 



5.2. Drift Estimates. Then next lemma shows that explicit computations are available for the 
quantity . We will use these explicit computations, together with quantification of the error 
committed in replacing by , to estimate the mean drift (in this section) and the diffusion term 
(in the next section). 

Lemma 5.3. The approximate acceptance probability q;'^(x,^) satisfies 

(2t 



Ex 



a\x,i)-i 



-CV^ix) 



Proof. Let u 
V G T-L^"^ we have 



2t 

5 '^^ 



E, 



' i £ To prove the lemma it suffices to verify that for all 



-{CV^{x),v) 



To this end, use the decomposition v = aV^'(x) + w where a G M and w ^ % ^ satisfies 



V 



(CV^'(x), w) = 0. Since ^ ~ N(0, C) the two Gaussian random variables 



= {V^{x),i) and 



are independent: indeed, {Zx^, Z^) is a Gaussian vector in with Cov(Z^, Z^) = 0. It thus follows 
that 

{u,v) = -2 (E4(V^'(x),01{(vvi>{x),o>o} -^l , aV^{x) + w) 
= -2E-r aZll^z^>o} + ^u. ^*l{Zg,>o} 

= -2a E^ Z|,l|2^>o} = -OiErc Z| 
= -a((7V^(x),V^(x)) = (-CV*(x),aV^(x) + 
= -{CV^{x),v), 

which concludes the proof of Lemma 5.3. □ 
We now use this explicit computation to give a proof of the drift estimate Lemma 3.6. 
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Proof of Lemma 3.6. The function defined by Equation (3.8) can also be expressed as 



\x) = { 



{1-25)2-1 s 
^ ct 



(5.5) 



where the mean local acceptance probability a^{x) has been defined in Equation (3.5) and the two 
terms Bi and B2 are studied below. To prove Equation (3.20), it suffices to establish that 



\Bi + x\\P< 5-2(1 + \\x\\f) and \\B2 + CV^{x)\\p < 5-2 (1 + \\x\\f). 



We now establish these two bounds. 

• Lemma 5.1 and Corollary 5.2 show that 



\Bi+x\ 



{1-25)2-1 s 
; a 



{x)+iy \\xE 

.l\P+\a\x)-l\''} 
<{5P + 5Hl + \\x\\P)] \\x\\P < 5Hl + 



< 



{I 



^1-2,5)5-1 



• Lemma 5.1 shows that 



(5.6) 



(5.7) 



\B2 + CV^{x)E 



■E,[a\x,0^] + CV^{x) 



(5.8) 



< 5-mE,[{a'{x, - a'{x, 0} II! + II \/ y IE.[a'(x, 6 + CV^{x) 



By Lemma 5.3, the second term on the right hand is equal to zero. Consequently, Cauchy 
Schwarz' inequality implies that 



\B2 + CV*(x)||f < 5-m,[\a\x,i) - a'{x,0 



2i E 



<5-H5'{l + \\x\\t)Y < 5Hl + \\x 



Estimates (5.7) and (5.8) give Equation (5.6). To complete the proof we establish the bound (3.21). 
The expression (5.5) shows that it suffices to verify 

E,[a^{x,OC] < l + Ms- 
To this end, we use Lemma 5.3 and Corollary 5.2. By Cauchy-Schwarz, 



IE. 



a'{x,()-^ 



K(x,o-i]-e 



< 5—2 E,^ 



(a^(x,0-l) 



< 1 + 



X 



which concludes the proof of Lemma 3.6. 



□ 
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5.3. Noise Estimates. In this section we estimate the error in the approximation r'^'^ ~ N(0, Cg)- 
To this end, let us introduce the covariance operator D^{x) of the martingale difference , 

D\x) =e\t''^^ T^'^ \x^'^ = X 

For any x,u,v G Ti^ the operator D^{x) satisfies 

W.Ut^'^,u)s{T^'\v)s\x^'^ = x] = {u,D^{x)v)s. 

The next lemma gives a quantitative version of the approximation D^{x) ~ Cg- 

Lemma 5.4. (Noise estimates) 

Let Assumptions 2.1 hold. For any pair of indices i,j > 1, the martingale difference term T^{x,^) 
satisfies 

\{^uD\x)^j)s - {ipuCs^j)s\ < 5^ ■ {1 + \\x\\s) (5.9) 
\TTacew{D\x)) - Trace^.(C,)| < (55.(1 + \\x\\l). (5.10) 

Proof. See Appendix A. □ 

5.4. A Priori Bound. Now we have all the ingredients for the proof of the a priori bound 
presented in Lemma 3.8 which states that the rescaled process given by Equation (1.7) does not 
blow up in finite time. 

Proof Lemma 3.8. Without loss of generality, assume that p = 2n for some potitive integer 
n > 1. We now prove that there exist constants ai,a2,«3 > satisfying 

E[||x'='if ] < (ai + a2A;5)e"«'='^. (5.11) 

Lemma 3.8 is a straightforward consequence of Equation 5.11 since this implies that 



5 ^ E[||2;*^''^||f ] < S ^ (ai +a2A;<5)e"3fc5 ^ f [ai + a2 t) 

kS<T kS<T 



* < oo. 



For notational convenience, let us define V''^^ = E . To prove Equation (5.11), it suffices 

to establish that 

yk+1,5 _ yk,5 <CK6-{1 + y^'^) , (5.12) 

where K > is constant independent from 6 G (0, |). Indeed, iterating inequality (5.12) leads to 
the bound (5.11), for some computable constants ai, 02, 03 > 0. The definition of V'' shows that 

yk+l,S _ yk,S ^ ^ _ ^M)||2n _ ||^fc,5||2n] (5_13) 

where the increment x^^^'^ — x^'^ is given by 

^k+1,5 _ ^k,5 ^ ^k,s ^^-^ _ 25)5 - l^x'''^ + V2d -/'"'^C''- (5.14) 
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To bound the right-hand-side of Equation (5.13), we use a binomial expansion and control each 
term. To this end, we establish the following estimate: for all integers i,j, k >0 satisfying 

i + j + k = n and {i,j,k) / (n, 0, 0) 

the following inequality holds. 



E 



l-^ lis 



< + (5.15) 



To prove Equation (5.15), we separate two different cases. 

• Let us suppose {i,j,k) = (n — 1,0,1). Lemma 3.6 states that the approximate drift has a 



linearly bounded growth so that 



Consequently, we have 



S \\d%x' 



< 5(1+ lb' 



fc,(5| 



E 



„fe,<5||2 



X 



n-1 



{x^'^x' 



< E 



X 



M||2(n-1) ||^M||^ (1 + 



<5{l + 

This proves Equation (5.15) in the case (z,j, k) = (n — 1,0, 1). 

Let us suppose {i,j,k) |(n, 0,0), (n — 1,0, 1)|. Because for any integer p > 1, 

^<5^ (i + blU) 



E, 



I _ ^k,5 lip 



it follows from Cauchy-Schwarz inequality that 



E 



Since we have supposed that (i, j, k) |(n, 0,0), (n — 1, 0, 1)| and i + j + k = n, it follows 
that J + I > 1. This concludes the proof of Equation (5.15), 
The binomial expansion of Equation (5.13) and the bound (5.15) show that 

yk+l,S_yk,S < (^(l + yfc.<5). 

This proves Equation (5.12), which concludes the proof of Lemma 3.8. □ 

5.5. Invariance Principle. Combining the noise estimates of Lemma 5.4 and the a priori bound 
of Lemma 3.8, we show that under Assumptions 2.1 the sequence of rescaled noise processes defined 
in Equation 3.9 converges weakly to a Brownian motion. This is the content of Lemma 3.7 whose 
proof is now presented. 

Proof of Lemma 3.7. As described in [Ber86] [Proposition 5.1], in order to prove that 
converges weakly to W in C{[0,T];T-L^) it suffices to prove that for any t € [0,T] and any pair of 
indices i,j > the following three limits hold in probability, 
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(5.16) 
(5.17) 
(5.18) 



lim^^o 5 J]]E[||r'='''||2 = t-TVace^^.(C, 



k5<t 



k5<t 



k5<T 

We now check that these three conditions are indeed satisfied 
• Condition (5.16): since E 



t {ifi, Cs(Pj)s 

= Ve > 0. 



|pfc,<5||2 |„fc,(5 
I-*- lis I-'' 



E 



|pfc,<5||2 <k,5 



Trace-US [D^(x^'^)), Lemma 5.4 shows that 



where the error term e'^ satisfies |ef (x)| ^ (l + ||x||^). Consequently, to prove condition 
(5.16) it suffices to establish that 



limE[|<5 5]_e?(x'=" 



k5<T 



0. 



We have E[\6 T^kSKT^^ii^'"'^)]] ^ ^^{^ ' ^[T>k5<T (1 + W^'^'^E) } and the apriori bound 



presented in Lemma 3.8 shows that 



sup i^-IE 2^ (l + Wx" 



;[E (i + ii-'-'ii?)]} 



< oo. 



Consequently lim5_^o ^[1*^ X]fc5<r ~ 0' conclusion follows. 

Condition (5.17): Lemma 5.4 states that 



Efc 



^k,5 



S/^k,S\ 



where the error term eg satisfies |e2(x)| ^ 6» (1 + ||a:;||s)- The exact same approach as the 
proof of Condition (5.16) gives the conclusion. 
• Condition (5.18): from Cauchy-Schwarz and Markov's inequalities it follows that 



E 



I 1 1 2 IT 

|i L -I^{|ir''-.*|i2>5-ie} 



< E 



< E 



^k,5\\4 



■.fc,5ll4 



I f E[||rM||^] -(| 



< ^ <5^ • E 



Consequently we have 

E 



^k,5\\A 



1 



k5<T kS<T 
and the conclusion again follows from the a priori bound Lemma 3.8. 
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6. Quadratic Variation. As discussed in the introduction, the SPDE (1.6), and the MetropoUs- 
Hastings algorithm pCN which approximates it for smaU 5, do not satisfy the smoothing property 
and so almost sure properties of the limit measure vr'^ are not necessarily seen at finite time. In 
this section we prove a limit theorem satisfied by such almost sure quantities, under pCN. When 
C is the covariance of Brownian motion or Brownian bridge then the almost sure quantity will be 
quadratic variation; for other covariances it will be a generalization defined precisely in the following 
subsection. We show that the quadratic variation of pCN converges as /c — )• oo to its value under 
the invariant measure. We then prove that piecewise linear interpolation of this quantity solves, 
in the small 6 limit, a linear ODE (the "fiuid limit") whose globally attractive stable state is the 
almost sure quantity. This quantifies the manner in which the pCN method approaches statistical 
equilibrium. 

6.1. Definition and Properties. Under Assumptions 2.1, the Karhunen-Loeve expansion shows 
that TTQ-almost every x £ 71 satisfies 

This motivates the definition of the quadratic variation like quantities 

" / \2 / \2 

V. (x) = hm inf V and V+ (x) = hm sup A^-^ V . 

When these two quantities are equal the vector x £ T-L is said to possess a quadratic variation V{x) 
defined as V{x) = V-{x) = V+{x). Consequently, vro-almost every x £ 7i possesses a quadratic 
variation V{x) = 1. It is a straightforward consequence that vrQ-almost every and vr "^-almost every 
X G Ti possesses a quadratic variation V{x) = r. Strictly speaking this only coincides with quadratic 
variation when C is the covariance of a (possibly conditioned) Brownian motion; however we use 
the terminology more generally in this section. The next lemma proves that the quadratic variation 
V{-) behaves as it should do with respect to additivity. 

Lemma 6.1. (Quadratic Variation Additivity) 

T> 

Consider a vector x £% and a Gaussian random variable ^ ~ ttq and a real number a G M. Suppose 
that the vector x £ Ti possesses a finite quadratic variation V{x) < +oo. Then almost surely the 
vector X + £7i possesses a quadratic variation that is equal to 

V{x + aO = V{x) + a'^. 



Proof. Let us define Vn = N'^ ^"'^^'ff'''^^ To prove Lemma 6.1 it suffices to prove that 

j 

almost surely the following limit holds 



lim Vn = 0. 



Borel-Cantelli Lemma shows that it suffices to prove that for every fixed e > we have X^^>i IP [| Vat | > 
el < oo. Notice then that Vn is a centred Gaussian random variables with variance 



Var(y^) = -(A-ij; 

1 ' '3 

It readily follows that J2n>i ^[\^n\ > e] < oo, finishing the proof of the Lemma. □ 
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6.2. Large k Behaviour of Quadratic Variation for pCN. The pCN algorithm at temperature 
r > and discretization parameter (5 > proposes a move from x to y according to the dynamics 

y = {l-26)^x + {26T)H with ^ ~ ttq. 

This move is accepted with probability a^{x, y). In this case, Lemma 6.1 shows that if the quadratic 
variation V{x) exists then the quadratic variation of the proposed move y & Ti exists and satisfies 

Consequently, one can prove that for any finite time step 5 > and temperature r > the quadratic 
variation of the MCMC algorithm converges to r. 

Proposition 6.2. (Limiting Quadratic Variation) Let Assumptions 2.1 hold and {x^'^}k>f) 
he the Markov chain of section 3.1. Then almost surely the quadratic variation of the Markov chain 
converges to t, 

lim ^(x*^'^) = r. 

Proof. Let us first show that the number of accepted moves is infinite. If this were not the case, 
the Markov chain would eventually reach a position x^'^ = x such that all subsequent proposals 
yk+i — (^-y _ 2(5)2 + (2r5)2 ^'^+' would be refused. This means that the i.i.d. Bernoulli random 
variables "y^^^ = Bernoulli (a*^ (x'^, y'^ "*"')) satisfy ^'^+' = for all / > 0. This can only happen with 
probability 0. Indeed, since ^[7'^"'"' = 1] > 0, one can use Borel-Cantelli Lemma to show that almost 
surely there exists I > such that 7*^+' = 1. To conclude the proof of the Proposition, notice then 
that the sequence {uk}k>o defined by Uk+i — Uk = —26{uk — t) converges to r. □ 

6.3. Fluid Limit for Quadratic Variation of pCN. To gain further insight into the rate at which 
the limiting behaviour of the quadratic variation is observed for pCN we derive an ODE "fluid 
limit" for the Metropolis-Hastings algorithm. We introduce the continuous time process 
defined as continuous piecewise linear interpolation of the the process k 1— V{x'''^) as follows: 

vHt) = ]{t-tk)V{x''+''^) + ^{tk+^-t)V{x'''^) for tk<t<tk+i. (6.2) 

Since the acceptance probability of pCN approaches 1 as 5 — (see Corollary 5.2) Equation (6.1) 
shows heuristically that the trajectories of of the process t 1— )■ v^{t) should be well approximated 
by the solution of the (non stochastic) differential equation 

v = -2{v-t) (6.3) 

We prove such a result, in the sense of convergence in probability in C([0,T];]R): 

Theorem 6.3. (Theorem 3: Fluid Limit For Quadratic Variation) Let Assumptions 2.1 
hold. Let the Markov chain x^ start at fixed position x^ € %^ . Assume that x^ ^H. possesses a finite 
quadratic variation, V{x^) < 00. Then the function v^{t) converges in probability in C([0,r],M), 
as 6 goes to 0, to the solution of the differential equation (6.3) with initial condition vq = V{x^:). 
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As already indicated, the heart of the proof of the result consists in showing that the acceptance 
probability of the algorithm converges to 1 as goes to 0. We prove such a result as Lemma 6.4 
below, and then proceed to prove Theorem 6.3. To this end we introduce t^{k), the number of 
accepted moves: 

l<k 

where 7''^ = Bernoulli(a''(a;, y)) is the Bernoulli random variable defined in Equation (3.4). Since 
the acceptance probability of the algorithm converges to 1 as 5 — )• 0, the approximation t^{k) w k 
holds. In order to prove a fluid limit result on the interval [0, T] one needs to prove that the 
quantity |t'^(fc) — k\ is small when compared to 6~^. The next Lemma shows that such a bounds 
holds uniformly on the interval [0,T]. 

Lemma 6.4. (Number of Accepted Moves) Let Assumptions 2.1 hold. The number of ac- 
cepted moves t^(-) verifies 

lim sup {6 • \t\k) - k\ : < k < T6-^} = 
where the convergence holds in probability. 

Proof. The proof is given in Appendix B. □ 
We now complete the proof of Theorem 6.3 using the key Lemma 6.4. 

Proof of Theorem 6.3. The proof consists in proving that the trajectory of the quadratic 
variation process behaves as if all the move were accepted. The main ingredient is the uniform 
lower bound on the acceptance probability given by Lemma 6.4. 

Recall that v^{k6) = V{x^'^). Consider the piecewise linear function v^{-) G C([0,T],]R) defined 
by linear interpolation of the values v^{k6) = u^{k) and where the sequence {u^{k)}k>o satisfies 
u\{)) = V{x^) and 

u\k + l)-u\k) = -25{u^{k) -t). 

The value u^{k) G M represents the quadratic variation of x^'^ if the k first moves of the MCMC 
algorithm had been accepted. One can readily check that as 6 goes to zero the sequence of contin- 
uous functions v^{-) converges in C([0,T],M) to the solution v{-) of the differential equation (6.3). 
Consequently, to prove Theorem 6.3 it suffices to show that for any e > we have 



lim 



sup||y(x^''^) -n^(A;)| : < ^^rj > e = 0. (6.4) 

The definition of the number of accepted moves t^{k) is such that V{x''^^) = u^t^k)). Note that 
(6.5) u\k) = (1 - 25)''uo + (1 - (1 - 25)'')t. 

Hence, for any integers ti,t2 > 0, we have \u^{t2) — u^{ti)\ < \u^{\t2 — ti|) — 'u'^(0)| so that 
\V{x'''^)-u^{k)\ = \u^{t\k)) -u^{k)\ < \u^{k-t\k))-u^{0)\. 
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Equation (6.5) shows that \u^{k) - u^{0)\ < (l - (1 - 25)^). This imphes that 

\V{x'''^)-u\k)\ < 1 - (1 - 25)^-*'('=) < 1 - (1 -25)"^"'^ 

where S = sup {6 • \t\k) - k\ : < k < T5-^}. Since for any a > we have 1 - (1 - 25)"'^"' 
1 — e~'^"', Equation (6.4) fohows if one can prove that as 6 goes to the supremum S converges to 
in probabihty: this is precisely the content of Lemma 6.4. This concludes the proof of Theorem 
6.3. □ 

7. Numerical Results. The numerical results comparing the efficiency of the RWM and pCN 

are presented in a companion paper [CRSWll] and we refer the interested reader to that paper. 
In this section, we present some numerical simulations demonstrating our results in the context of 
simulated annealing. We consider the minimisation of a functional J(-) defined on the Sobolev space 
H^{M.) C C°([0, 1]) C L2(0, 1). Functions x € H^{[0, 1]) are continuous and satisfy x(0) = x(l) = 0. 
For a given real parameter A > 0, the functional J : Hq{[0, 1]) — ?• M is composed of two competitive 
terms, as follows: 



The first term penalises functions that deviate from being flat, whilst the second term penalises 
functions that deviate from one in absolute value. Critical points of the functional J(-) solve the 
following Euler-Lagrange equation: 



Clearly x = is a solution for all A G If A G (0,7r^) then this is the unique solution of the 
Euler-Lagrange equation and is the global minimizer of J. For each integer k there is a supercritical 
bifurcation at parameter value A = /c^vr^. For A > vr^ there are two minimizers, both of one sign 
and one being minus the other. The three different solutions of (7.2) which exist for A = 27r^ 
are displayed in Figure 1, at which value the zero (blue dotted) solution is a saddle point, and 
the two green solutions are the global minimizers of J. These properties of J are overviewed in, 
for example, [HenSl]. We will show how these global minimizers can emerge from an algorithm 
whose only ingredients are an ability to evaluate ^ and to sample from the Gaussian measure 
with Cameron-Martin norm \x{s)\'^ds. We emphasize that we are not advocating this as the 
optimal method for solving the Euler-Lagrange equations (7.2). We have chosen this example for 
its simplicity, in order to illustrate the key ingredients of the theory developed in this paper. 

The pCN algorithm to minimize J given by (7.1) is implemented on i^([0, 1]). Recall from 
section 1 that the Gaussian measure N(0, C) may be identified by finding the covariance operator 
for which the H^{[0,1]) norm ||x||^ — Jq |2;(s)| ds is the Cameron-Martin norm. In [HAVW05] it 
is shown that the Wiener bridge measure Wo-5.0 on L'^{[0, 1]) has precisely this Cameron-Martin 
norm; indeed it is demonstrated that C^^ is the densely defined operator — ^ with D{C^^) = 
ii"^([0, 1]) n -f^o([0, 1]). In this regard it is also instructive to adopt the physicists viewpoint that 




(7.1) 



x + Xx{l-x^) = 
x(0) = x(l) = 0. 



(7.2) 
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Euler-Lagrange 




Fig 1. The three solutions of the Euler-Lagrange Equation (7.2) for \ = 2-k . Only the two non-zero solutions are 
global minimum of the functional J(-). The dotted solution ts a local maximum of J(-)- 



although, of course, there is no Lebesgue measure in infinite dimensions. Using an integration by 
parts, together with the boundary conditions on Hq{[Q, 1]), then gives 



WQ^Q{dx) oc exp 



(P'X 

x(s)-^(s) ds 



dx 



and the inverse of C is clearly identified as the differential operator above. See [CH06] for basic 
discussion of the physicists viewpoint on Wiener measure. For a given temperature parameter r 
the Wiener bridge measure Wq^q is defined as the law of {\/^^(*)}fg[o i] where 

{VK(t)}tg[o,i] is a standard Brownian bridge on [0, 1] drawn from Wo-s>o- 

The posterior distribution Tr'^{dx) is defined by the change of probability formula 



dn'' 



dm, 



-{x) oc 



0^0 



A 

with ^'(x) = -. 

4 Jo 



X[S 



ifds. 



Notice that 7r^{H^{[0,l)) = ir'^ {H^{[0,1)) = since a Brownian bridge is almost surely not dif- 
ferentiable anywhere on [0, 1]. It is for this reason that the algorithm is implemented on L^([0, 1]) 
even though the functional J(-) is defined on the Sobolev space Hq{[0, 1]). In terms of Assumptions 
2.1(1) we have k = 1 and the measure vrj is supported on T-L^' if and only if r < ^, see Remark 2.2; 
note also that -f^o([0, 1]) = T-L^. Assumption 2.1(2) is satisfied for any choice s E [\, 5) because H'' 
is embedded into L'^{[0, 1]) for s > |. We add here that Assumptions 2.1(3-4) do not hold globally, 
but only locally on bounded sets, but the numerical results below will indicate that the theory 
developed in this paper is still relevant and could be extended to nonlocal versions of Assumptions 
2.1(3-4), with considerable further work. 

Following section 3.1, the pCN Markov chain at temperature r > and time discretization 5 > 
proposes moves from x to y according to 



y = {l-26)^x + (25r)5^ 
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where ^ S C'llO, is a standard Brownian bridge on [0, 1]. The move x — )• y is accepted with 

probabihty a^{x,^) = 1 A exp ( — T~^[\I'(y) — ^'(x)]). Figure 2 displays the convergence of the 
Markov chain {x^'^}k>(} to a minimiser of the functional J(-)- Note that this convergence is not 
shown with respect to the space i^o([0, 1]) on which J is defined, but rather in L^([0, 1]); indeed 
J(-) is almost surely infinite when evaluated at samples of the pCN algorithm, precisely because 
ttq (//q([0, 1)) = 0, as discussed above. 



error 

0.7| 1 1 




'o 2000 4000 6000 8000 10000 

algorithmic time 



Fig 2. pCN parameters: X = 27r^, S = 1.10 ^, t = 1.10 ^. The algorithm is started at the zero function, x°-\t) = 
for t G [0, 1]. After a transient phase, the algorithm fluctuates around a global mimmiser of functional J{-). The l? 
error \^x^'^ — (minimiser) 11^2 is plotted as a function of the algorithmic time k. 

Of course the algorithm does not converge exactly to a minimiser of J(-), but fluctuates in a 
neighbourhood of it. As described in the introduction of this article, in a finite dimensional setting 
the target probability distribution vr"^ has Lebesgue density proportional to exp ( — r^^ J{^))- This 
intuitively shows that the size of the fluctuations around the minimum of the functional J(-) are of 
size proportional to y/r. Figure 3 shows this phenomenon on log- log scales: the asymptotic mean 
error E[||x — (minimiser) || 2] is displayed as a function of the temperature r. Figure 4 illustrates 
Theorem 6.3. One can observe the path {v^{t)}t^[Q^x] for a finite time step discretization parameter 
6 as well as the limiting path {w(t)}tg[o,T] that is solution of the differential equation (6.3). 

8. Conclusion. In the following we briefly summarize our results, their implications, the the- 
oretical tools employed and scope for future work. 

• The paper is organized around the Optimal Proposal Design Principle that proposals 
which are well-defined on the infinite dimensional parameter space results in MCMC methods 
which do not suffer from the curse of dimensionality. Our emphasis is on the general setting 
of Gaussian random field prior distributions and associated problems in Bayesian nonpara- 
metrics. 

• We exhibit an example which highlights the importance of the optimal design of proposals 
in the context of random walk methods. We show that modifying the mean of the standard 
random walk proposal (RWM), to obtain the preconditioned Crank- Nicolson walk proposal 
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Mean Error 




-2.0- 



0.5 1.0 1.5 2.0 2.5 3.0 

-loglo(T) 

Fig 3. 

Mean error E[||2; — (minimiser)||2] as a function of the temperature r. 

(pCN), results in an order of magnitude improvement in computational complexity when 
applied to problems with Gaussian random field prior. In contrast, previous theories have 
concentrated on optimal scaling of the proposal variance [RGG97] and this has, in comparison, 
a negligible effect on complexity. Theorems 1 and 2 in the introduction show that when 
discretizing nonparametric problems to dimensions the pCN method achieves a similar 
error to RWM with computational complexity reduced by a factor of order 0{N). 

• More generally, we emphasize the fact that designing MCMC algorithms for nonparametric 
problems needs more study since it is extremely useful for applied researchers to have guide- 
lines regarding implementation of algorithms. For nonparametric problems, construction of 
prior distributions and likelihood modeling must be integrated with designing proposals and 
vice versa. This synthesis will have a beneficial impact on both the statistical inference and 
the efficiency of the algorithm. In this regard, we also advocate that in Bayesian nonparamet- 
ric problems, the primary focus should be on optimal design of algorithms (which inherit the 
structure of prior and the likelihood) as compared to optimal scaling. A review of a wide range 
of algorithms which do this for Gaussian random field priors may be found in [CRSWll]. 

• Our basic tool of analysis is the construction of diffusion limits for the MCMC method and 
use of ergodicity of the limit process to quantify the computational cost of obtaining sample 
path averages which are close to the desired expectation. To prove these diffusion limits we 
have further developed the methods from [MPSll], which apply to the RWM, so that they 
may be used to study the pCN method. The diffusion limit results obtained in this paper 
use a proof technique analogous to the proof of diffusion limits of Markov chains in finite 
dimensions. We believe therefore that the methods of analysis that we introduce may be used 
to understand other MCMC algorithms, based on local proposals, and other nonparametric 
problems. The approach is encapsulated in Lemma 3.5 which is structured in such a way that 
it, or variants of it, may be used to prove diffusion limits for a variety of problems, especially 
when there is an underlying Gaussian structure. 
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Quadratic Variation 




Rescaled Time 



Fig 4. pCN parameters: X = 2n^, r = 1.10~^, 5 = 1.10"'^ and the algorithm starts at x'^'^ — 0. The rescaled 
quadratic variation process (full line) behaves as the solution of the differential equation (dotted line), as predicted by 
Theorem 6.3. The quadratic variation converges to r , as described by Proposition 6.2. 



• The recently developed analysis of 1-Wasserstein spectral gaps for MCMC in high dimensions 
[HSVll] shows great promise as a potential tool for the study of a wide range of MCMC 
methods developed to adhere to the design principle studied in this paper, even outside the 
class of local proposal MCMC methods. This is because the methods are well-adapted to the 
high dimensional limit, being based on study of infinite dimensional problems. In contrast, 
the methods of [MT93], which have been widely adopted by statisticians, do not scale so well 
with respect to dimension since they fail completely in many infinite dimensional settings. 

• There are a host of tools known outside the statistics literature which are extremely useful 
for obtaining guidelines for implementation of algorithms. For instance, the pCN adheres 
to the "optimize then discretize" perspective where as the RWM sticks to the "discretize 
then optimize" view point. In numerical analysis, it is known that in some problems, the 
former performs better than the latter (see [HPUU08], Chapter 3). We illustrate this point 
in statistical examples. 

• We have demonstrated a class of algorithms to minimize the functional J given by (1.12). The 
Assumptions 2.1 encode the intuition that the quadratic part of J dominates. Under these 
assumptions we study the properties of an algorithm which requires only the evaluation of ^ 
and the ability to draw samples from Gaussian measures with (Cameron-Martin) norm given 
by the quadratic part of J. We demonstrate that, in a certain parameter limit, the algorithm 
behaves like a noisy gradient flow for the functional J and that, furthermore, the size of 
the noise can be controlled systematically. Thus we have constructed a simulated annealing 
algorithm on Hilbert space, and connected this to a diffusion process (SDE), a connection 
made in finite dimensions in [GH86]. The applications we are interested is mainly on Bayesian 
nonparametrics; but of course, there are many more problems to which our techniques and 
results are applicable, e.g., optimization, control theory, etc. 

• The diffusion limits that we prove may not hold for discrete priors such as the Dirichlet pro- 
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cesses, or more generally Levy random field priors [WCTll] because there is no underlying 
Gaussian measure. Likewise there is a growing interest for inverse problems in Besov priors 
based on wavelet bases and non-Gaussian random coefficients [SS^09]. Other Markov process 
limits are to be expected instead of a diffusion process. These class of models are very impor- 
tant in applications used in varied fields including computer science, machine learning and 
medical imaging. Thus it is of great importance to understand and quantify their efficiency; 
to achieve this will need new techniques and will be pursued elsewhere. 



APPENDIX A: PROOFS OF LEMMAS; SECTION 4 
Proof of Lemma 5.1. Let us introduce the two 1-Lipschitz functions /i, : 



defined 



by 



h{x) = lAe^ and = 1 + a; l|^<o}- 

The function h^: is a first order approximation of /i in a neighbourhood of zero and we have 

f26 



a^{x,0 = /i(-^{^(y)-^'(x)}) and a^{x,0 = h. 



{V^{x),0 



where the proposal y is a function of x and ^, as described in Equation (3.1). Since is close 
to h{-) in a neighbourhood of zero, the proof is finished once it is proved that — ^{^'(y) — ^'(x)} is 

close to -yjf (V^'(x),0- We have \a^{x,0 

— a^{x, ^)\^\ ^ ^1 + A2 where the quantities Ai 

and A2 are given by 



A2 = 



\h{--{^{y)-^{x)}) - h{ 
{V^{x),0) - K{ 



126 

T 
T 



(v*(x),e»r 

(vM/(x),o)r 



By Lemma 2.4, the first order Taylor approximation of ^' is controled, — ^'(x) — (V^'(x), y ■ 

^)| ^ \\y ~ ^lls- The definition of the proposal y given in Equation (3.1) shows that ||(y — x) ■ 
\/25t^\\s ^ (^llxlls. Assumptions 2.1 state that for z G Ti'^ we have {V^{x),z) < (l + \\x\\s) ■ \\z\\ 
Since the function /i(-) is 1-Lipschitz it follows that 



Ai=E^ 



\h{--{^{y)-^{x)})-h{ 



(V^(x),0)| 



(A.2) 



l^'(y) -^(x) - (V^'(x),y-x)P + |(V^'(x), y - x - \/2^0 P 



|y - X 



|2p 



+ (l + ||x||P) -(5 11x1 



< 5^(l + ||x 



|2p^ 



Lemma 3.1 has been used to control the size of Ex [\\y — x||^] . To bound A2, notice that for z € 
we have \h{z) — h^:{z)\ < \z'^. Therefore the quantity A2 can be bounded by 



A2<E, 



|^/5(VM/(x),0l'^' 



< S^E. 



{^ + M\f)m\f < s^{i + \\x\\f). 



Estimates (A.2) and (A. 3) together give Equation (5.2). 
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(A.3) 

□ 



Proof of Corollary 5.2. Let us prove Equations (5.3) and (5.4). 

• Lemma 5.1 and Jensen's inequality give Equation (5.3). 

• To prove (5.4), one can suppose (52||x||f < 1. Indeed, if > 1, we have 



\a\x,0 



< 1 < <52||x||P < (52 {l + \\x\\P), 



p 

which gives the result. We thus suppose from now on that (52||x||s < 1. Under Assumptions 
2.1 we have ||V^(a;)||_s ^ 1 + \\x\\s- Lemma 2.4 shows that for all x,y we have |^'(y) — 
^{x) - (V^'(x),7/ - x)\ < \\y - The function h{x) = 1 A is 1-Lipschitz, a^{x,^) = 
h[ - \ [^{y) - ^(x)]) and /i(0) = 1. Consequently, 



E, 



\a\x,i) 



\h[--[^{y)-^{x)]) - h{f))\' 



< 



¥.,[\^{y)--^{x)\P] < E4\{Vnx),y-x)\P + \\y-x\\f] 



<{l+\\x\\P)-EJ\\y-x\\P]+EJ\\y-x 



|2P1 



By Lemma 3.1, for any integer /3 > 1 we have E2;[||y — x\\s] ^ (J'^HxUs + 6 2 so that the 
assumption (52||x||f < 1 leads to 



E. 



\a^(x) 



<{l + \\xE).i5P\\xE + 6^ + {5'P\\x\\f + 5n 
<{l + \\x\\P)-{62 +6"^) + {6P + 6P) 
<<5' (l + lkllD- 



This finishes the proof of Corollary 5.2. 



□ 



Proof of Lemma 5.4. The martingale difference r'^(x,i^) defined in Equation (3.8) can also 
be expressed as 

T'{x,C)=^ + F{x,0 
where the error term F(a;, ^) = Fi{x,^) + F2{x,^) is given by 

Fi(x,C) = (2r<5)-i((l-25)i-l) {j'{x,C) - E,[j'{x,0])x 



We now prove that the quanity F(x,^) satisfies 



E, 



\\Fix,Ofs <S-^ il + M's) 



(A.4) 



We have 6 2 ((1 — 25)2 — l) < ^2 and |7'^(x,^)| < 1. Consequently, 



E^ 



\Fiix,ml 



< S 11x11 



(A.5) 
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Let us now prove that F2 satisfies 



\\F2{x,ml <s-^ a + \\xp)- 



(A.6) 



To this end, use the decomposition 



\\F2ix,ml 



h + h 



Cauchy-Schwarz inequality shows that /i < E^^ |7'^(a;, ^) — 1| 



where the Bernouhi random 



variable 7'^(x,^) can be expressed as 7'^(x,^) = ^{u<a^(x,£,)} where U ~ Uniform(0, 1) is 
independent from any other source of randomness. Consequently 



E, 



l7'(^,0-l| 



where the mean local acceptance probability a^[x) is defined by a^{x) = ¥.x[oi^ {x , € [0, 1]. 
The convexity of the function x — )• |1 — x| ensures that 

|l-a'^(x)| = |l-E4a^(x,e)]| < ^x[\l - a\x , < 55 (1 + ||x||) 

where the last inequality follows from Corollary 5.2. This proves that Ii ^ 5* (l + ||x||2). To 
bound 12, it suffices to notice 

h= iiE47'(x,e)-e]ii? = iiE4(7^(x,e)-i)-c]ii? 

<E.[|7^(x,0-lMlell^] = h 

so that /2 </i < (1 + lk||5) andE,[||F2(x,C)||2] < 6^ {l + \\x\\h). 
Combining Equation (A. 5) and (A.6) gives Equation (A. 4). 

Let us now describe how Equations (5.7) and (5.8) follow from the estimate (A. 4). 

• WehaveE[((^i,,^)s((^j,^)s] = {(pi,Cs (pj) s w.d¥.a;[{(pi,T\x,i)) s{(pj,T\x,i)) s] = {ipi,D\x) (pj), 
with r''(j;,,^) = ^ + F{x,S,)- Consequently, 

{ipi,D^{x)ipj)s - {ip^,Cs ipj)s = E,[{ipi,F{x,O)s{0j,F{x,O)s] 

+ E4{ipi,0s{iPj,F{x,0)s] 
+ Ex[{^i,F{x,C))s{ipj,Os]- 
We have F{x,^))s\ < ||F(x,^)||s and Cauchy Schwarz's inequality proves that 

Ex[{^i,F{x,O)s{0j,Os? < E,[||F(x,C)||s m\s? 

<E,[\\F{x,Ofs]- 

It thus follows from Equation (A. 4) that 

\{0,,d\x)^,)s - (<^„c,(^,),|<e4||f(x,0II']+ie4II^(^'^)II']' 

< -5^(1 + 11^11.), 

finishing the proof of (5.7). 
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• We have TTacen-{Cs) = HUE] and Tracen^ {D^ (x)) = E[\\T^{x,OE]- Estimate (A.4) thus 
shows that 

\TvaceH^{D\x)) - Trace^^ (C.) | = |E[||r^(x, ^ ||^ - ||ell?]| = iHU + H^,ml-Ufs]\ 

< |E[(2e + Fix,0,F{x,C))s\ < E[||2e + \\Fix,ms] 

<E[m\U\\F{x,o\\f-nmx,ofs]'' 

< (l + 5i (1 + ^ . (5i (1 + ||x||,))) < 5l{l + \\x\\l), 

finishing the proof of (5.8). 

□ 

APPENDIX B: PROOF OF LEMMA 5.4 

Before proceeding to give the proof, let us give a brief proof sketch. The proof of Lemma 6.4 
consists in showing first that for any e > one can find a ball of radius -R(e) around in T-L^, 

Bo{R{e)) = {x e : \\x\\s<R{e)}, 

such that with probability 1 -2e we have x^'"^ G Bo{R{e)) and y'''^ £ Bo{Rie)) for all < A; < Td'^. 
As is described below, the existence of such a ball follows from the bound 

E[ sup \\x{t)\\,] < +00 (B.l) 
te[o,T] 

where t i— )■ x{t) is the solution of the stochastic differential equation (3.6). For the sake of com- 
pleteness, we include a proof of Equation (B.l). The solution 1 1— )• x{t) of the stochastic differential 
equation (3.6) satisfies x{t) = Jq d[x{u)) du + \/2TW{t) for all t S [0,r] where the drift function 
d{x) = — (x + CV^'(x)) is globally Lipschitz on as described in Lemma 2.4. Consequently 
M(2;)IU — + ll^lls) ^or some positive constant A > 0. The triangle inequality then shows that 

(B.2) ||x(t)||, {l + \\x{u)\\s)du + V2^\\W{t)\\s. 

Jo 

By Gronwall's inequality we obtain 

(B.3) sup||x(t)||, < {AT + sup \\W{t)\\,) [l + ATe^^]. 

[0,T] [0,T] 

Since E[sup[Q'p] l|W^(^)IU] < c«, the bound (B.l) is proved. 

Proof of Lemma 6.4. The proof consists in showing that the the acceptance probability of 
the algorithm is sufficiently close to 1 so that approximation t^{k) ~ k holds. The argument can be 
divided into 3 main steps. In the first part, we show that we can find a finite ball -6(0, R{£)) such 
that the trajectory of the Markov chain {x^'^}i:<ts-^ remains in this ball with probability at least 
1 — 2e. This observation is useful since the function ^ is Lipschitz on any ball of finite radius in 
Ti'^. In the second part, using the fact that ^ is Lipschitz on -6(0, -R(e)), we find a lower bound for 
the acceptance probability a^. Then, in the last step, we use a moment estimate to prove that one 
can make the lower bound uniform on the interval < k < T5^^. 
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Restriction to a Ball of Finite Radius 

First, we show that with high probability the trajectory of the MCMC algorithm stays in a 
ball of finite radius. The functional x i— )■ sup^g^.T] is continuous on C{[0,T],'Hs) and 

E[supjg[o,T] < oo for t I— )• x{t) following the stochastic differential equation (3.6), as 

proved in Equation (B.l). Consequently, the weak convergence of to the solution of (3.6) 
encapsulated in Theorem 3.2 shows that E[sup;j<y5-i can be bounded by a finite 

universal constant independent from 5. Given e > 0, Markov inequality thus shows that one 
can find a radius Ri = i?i(e) large enough so that the inequality 

P[||x'''^||, < Ri for ah < A; < TS^'^] > 1 - e (B.4) 
for any 6 G (0, By Fernique's Theorem there exists a > such that E[e""^"^] < oo. This 

2 

implies that P[||^||s > r] < e~"^ . Therefore, if {^fc}fc>o are i.i.d. Gaussian random variables 
x> 

distributed as ~ ttq, the union bound shows that 

P[||\/5^fclU <r for all < A; < TJ^^] > 1 - TS'^ exp(-a(5- V^). 

This proves that one can choose R2 = -R2(e) large enough in such a manner that 

^iW^/SCkh < R2 for all 0<k<T6-^]>l-e (B.5) 

for any 6 G (0, ^). At temperature r > the MCMC proposals are given by y'''^ = (1 — 
25)^ x'^'^ _l_ (25t)5 ^i^. It thus follows from the bounds (B.4) and (B.5) that with probability at 
least (1 — 2e) the vectors x^'*^ and y^''' belong to the ball Bo{R{e)) = {x e Us '■ \\x\\s < R{^)} 
for < A; < T6~^ where radius ii(e) is given by R{e) = -Ri(e) + i?2(e)- 
Lower Bound for Acceptance Probability 

We now give a lower bound for the acceptance probability a^{x^'^,£^^) that the move x'^'^ — )• 
y^'^ is accepted. Assumptions 2.1 state that ||V\l'(x)||_5 ^ 1 + II^^Hs- Therefore, the function 
^' : ■H'' ^ M is Lipschitz on Bo{R{e)), 

ll^rlliip,, supj l'^y ~y : x,ye Bo{R{e))} < 00. 

One can thus bound the acceptance probability a^{x^'^, = lAexp {—t~^ [^[y^'^)—"^[y'''^)]^ 
for x^'^,y^'^ G Bo{R{e)). Since the function z 1— >• 1 A e""^ is Lipschitz with constant r~^, 
the definition of ||^'||iip,£ shows that the bound 



\-a\x^^\e)<r-'mwA\y^ 



X 



< ||^||iip,e {[(1 - 25)^ - 1] \\x'^''\\s + {25t)1 

<^5{i+\\e\\s) 

holds for every x'''^,y^'^ G Bo(R{e)). Hence, there exists a constant K = K{e) such that 
3'5(^fc) = 1 _ kV6{1 + IIC^IU) satisfies a^{x^^^,^'') > a^{^'') for every x^^^y'''^ G Bo{R{e)). 
Since the trajectory of the MCMC algorithm stays in the ball i?o(-R(e)) with probability at 
least 1 — 2e the inequality 

P[q''(x'='^C^') > a^(e'') for ah < A; < T^^^] > 1 - 2e. 

holds for every 5 G (0, ^). 
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Second Moment Method 

To prove that t^{k) does not deviate too much from k, we show that its expectation satisfies 
E[t\k)] ^ k and we then control the error by bounding the variance. Since the Bernoulh 
random variable -y^'^ = Bernoulli(a^(x'^''^^'^)) are not independent, the variance of t^{k) = 
Yli<k ^'''^ is easily computable. We thus introduce i.i.d. auxiliary random variables j^'^ 
such that 

Y^t' = t'{k) ^ t^(fc) = EV''- 

l<k l<k 

As described below, the behaviour of t^{k) is readily controlled since it is a sum of i.i.d. ran- 
dom variables. The proof then exploits the fact that t^{k) is a good approximation of t^{k). 

The Bernoulli random variables 7'^''^ can be described as 7'^''^ = ll(C/fc < a^i^x^^^e!')) where 
{Uk}k>o are i.i.d. random variables uniformly distributed on (0, 1). As a consequence, with 
probability at least 1 — 2e, the random variables 7^^'^ = ll(?7fc < a^) satisfy 7*^''^ > 7'^''^ for 
all < A; < T5~^. Therefore, with probability at least 1 — 2e, we have t^^^^ > P^'^) for all 
0<k< T6-^ where = Ez<fc7'''^- Consequently, since t^'^-^^ < k, to prove Lemma 6.4 it 
suffices to show instead that the following limit in probability holds, 

lim sup{6-\?{k)-k\ :0<k<Td-^} =0. (B.6) 

Contrary to the random variables {7'^'''}fc>0) the random variables {7^'''}fc>o a-re i.i.d. and are 
thus easily controlled. By Doob's inequality we have 



sup{<5- -E[?(A:)]| : < A: < T^^} > r] 



^ ^ Var(?(r^-^)) ^ ^6T_ 



Since E[?(/c)] = k ■ [l - {I + , Equation (B.6) follows. This finishes the proof 

of Lemma 6.4. 



□ 
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